AWS Tobacco Settlement Retrieval

dc.contributor.authorSitaula, Anamolen
dc.contributor.authorMekap, Abhinandanen
dc.contributor.authorKanuri, Adityaen
dc.contributor.authorBossart, Douglasen
dc.contributor.authorPokharel, Nishanen
dc.contributor.authorRay, Rahulen
dc.date.accessioned2020-05-14T14:48:51Zen
dc.date.available2020-05-14T14:48:51Zen
dc.date.issued2020-05-14en
dc.description.abstractThe Tobacco Industry is one of the largest and most influential industries. It has spent hundreds of millions of dollars on advertising and marketing tactics to ensure dominance and control in the economy. This is especially evident when considering tobacco settlement cases where the enormous power and influence of the Tobacco Industry has allowed them to develop key strategies and tactics for trials and settlement cases over the past century. Our client Dr. Townsend is currently researching the tactics and inner-workings of the Tobacco Industry over the past few decades to expose the marketing and legal strategies as well as the key players who have been influential in the Industry. Dr. Townsend is utilizing the “Truth Tobacco Industry Documents”, a library of documents created and facilitated by the UCSF Library for research purposes. Our project is meant to further enable researchers specializing in business, public health, law or computer science, who will benefit from easier access to tobacco settlement related documents, with enhanced search capabilities, extending the work of the Fall 2019 CS5604 Information Retrieval teams. We studied the 14 million tobacco related documents from UCSF. We improved upon the indexing of the roughly 8000 depositions, to support line-wise as well as page-wise indexing. We modified and updated existing Python scripts to output the results in the required JSON format, and then pushed the documents into ElasticSearch. Furthermore, we also created another tobacco index and added another 3 million tobacco files to this index. All testing and evaluation work was done using Python scripts. We used the existing Kibana tool for the visual representation of the data.en
dc.description.notesThe TobaccoSettlementReport files contain the full report highlighting the design, implementation, and related manuals to fully explain the work we completed and illustrate the steps needed to both use the system as well as expand upon the current implementation. The files have been uploaded in both .docx and .pdf format. The TobaccoSettlementPresentation files are an abridged version of the report covering the main aspects of the project. The files have been uploaded in both .pptx and .pdf format. The TobaccoSettlementSupplement.tar file contains two folders named data and scripts. The scripts folder contains three scripts named file2jsonDepth.py; which is the code to index a document line-wse, ingestion_scripts.sh; is the script used to ingest batches of documents into ElasticSearch, and metadata_to_json_fast_line.py; which indexes the raw files and indexes them line-wise and formats them into a JSON format for ingestion into ElasticSearch. Furthermore, the data folder contains 2 files, a test file named ffxf0028 which is the test output of running the line-wise algorithm, and linewisedep10.json which shows 5 deposition documents parsed properly.en
dc.identifier.urihttp://hdl.handle.net/10919/98265en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectTobacco Industryen
dc.subjectDeposition Documentsen
dc.subjectTobacco Settlement Documentsen
dc.subjectDepositionsen
dc.subjectElasticSearchen
dc.subjectKibanaen
dc.titleAWS Tobacco Settlement Retrievalen
dc.typePresentationen
dc.typeReporten
dc.typeOtheren

Files

Original bundle
Now showing 1 - 5 of 5
Name:
TobaccoSettlementSupplement.tar
Size:
4.48 MB
Format:
Unknown data format
Name:
TobaccoSettlementPresentation.pptx
Size:
1.92 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
TobaccoSettlementPresentation.pdf
Size:
1.07 MB
Format:
Adobe Portable Document Format
Name:
TobaccoSettlementReport.docx
Size:
5.92 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
TobaccoSettlementReport.pdf
Size:
1.87 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: