AWS Tobacco Settlement Retrieval
dc.contributor.author | Sitaula, Anamol | en |
dc.contributor.author | Mekap, Abhinandan | en |
dc.contributor.author | Kanuri, Aditya | en |
dc.contributor.author | Bossart, Douglas | en |
dc.contributor.author | Pokharel, Nishan | en |
dc.contributor.author | Ray, Rahul | en |
dc.date.accessioned | 2020-05-14T14:48:51Z | en |
dc.date.available | 2020-05-14T14:48:51Z | en |
dc.date.issued | 2020-05-14 | en |
dc.description.abstract | The Tobacco Industry is one of the largest and most influential industries. It has spent hundreds of millions of dollars on advertising and marketing tactics to ensure dominance and control in the economy. This is especially evident when considering tobacco settlement cases where the enormous power and influence of the Tobacco Industry has allowed them to develop key strategies and tactics for trials and settlement cases over the past century. Our client Dr. Townsend is currently researching the tactics and inner-workings of the Tobacco Industry over the past few decades to expose the marketing and legal strategies as well as the key players who have been influential in the Industry. Dr. Townsend is utilizing the “Truth Tobacco Industry Documents”, a library of documents created and facilitated by the UCSF Library for research purposes. Our project is meant to further enable researchers specializing in business, public health, law or computer science, who will benefit from easier access to tobacco settlement related documents, with enhanced search capabilities, extending the work of the Fall 2019 CS5604 Information Retrieval teams. We studied the 14 million tobacco related documents from UCSF. We improved upon the indexing of the roughly 8000 depositions, to support line-wise as well as page-wise indexing. We modified and updated existing Python scripts to output the results in the required JSON format, and then pushed the documents into ElasticSearch. Furthermore, we also created another tobacco index and added another 3 million tobacco files to this index. All testing and evaluation work was done using Python scripts. We used the existing Kibana tool for the visual representation of the data. | en |
dc.description.notes | The TobaccoSettlementReport files contain the full report highlighting the design, implementation, and related manuals to fully explain the work we completed and illustrate the steps needed to both use the system as well as expand upon the current implementation. The files have been uploaded in both .docx and .pdf format. The TobaccoSettlementPresentation files are an abridged version of the report covering the main aspects of the project. The files have been uploaded in both .pptx and .pdf format. The TobaccoSettlementSupplement.tar file contains two folders named data and scripts. The scripts folder contains three scripts named file2jsonDepth.py; which is the code to index a document line-wse, ingestion_scripts.sh; is the script used to ingest batches of documents into ElasticSearch, and metadata_to_json_fast_line.py; which indexes the raw files and indexes them line-wise and formats them into a JSON format for ingestion into ElasticSearch. Furthermore, the data folder contains 2 files, a test file named ffxf0028 which is the test output of running the line-wise algorithm, and linewisedep10.json which shows 5 deposition documents parsed properly. | en |
dc.identifier.uri | http://hdl.handle.net/10919/98265 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons CC0 1.0 Universal Public Domain Dedication | en |
dc.rights.uri | http://creativecommons.org/publicdomain/zero/1.0/ | en |
dc.subject | Tobacco Industry | en |
dc.subject | Deposition Documents | en |
dc.subject | Tobacco Settlement Documents | en |
dc.subject | Depositions | en |
dc.subject | ElasticSearch | en |
dc.subject | Kibana | en |
dc.title | AWS Tobacco Settlement Retrieval | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Other | en |
Files
Original bundle
1 - 5 of 5
Loading...
- Name:
- TobaccoSettlementPresentation.pdf
- Size:
- 1.07 MB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- TobaccoSettlementReport.pdf
- Size:
- 1.87 MB
- Format:
- Adobe Portable Document Format
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: