Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019

dc.contributor.authorMuhundan, Sushmethaaen
dc.contributor.authorBendelac, Alonen
dc.contributor.authorZhao, Yanen
dc.contributor.authorSvetovidov, Andreien
dc.contributor.authorBiswas, Debasmitaen
dc.contributor.authorMarin Thomas, Ashinen
dc.date.accessioned2020-01-15T02:55:02Zen
dc.date.available2020-01-15T02:55:02Zen
dc.date.issued2019-12-11en
dc.description.abstractConsumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.en
dc.description.notesCMTpresentation.pdf: PDF version of the final presentation CMTpresentation.pptx: PowerPoint version of the final presentation CMTreport.pdf: PDF version of the final report CMTreportOverleaf.zip: Archive of Overleaf project of report CMTcodebase.zip: Archive of software developeden
dc.description.sponsorshipIMLS LG-37-19-0078-19en
dc.identifier.urihttp://hdl.handle.net/10919/96437en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/en
dc.subjectInformation storage and retrievalen
dc.subjectTobacco Settlement Documentsen
dc.subjectData pre-processingen
dc.subjectLemmatizationen
dc.subjectTokenizationen
dc.subjectCS5604en
dc.subjectMetadata extractionen
dc.subjectCephen
dc.subjectPythonen
dc.subjectData cleaningen
dc.titleCollection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019en
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
CMTcodebase.zip
Size:
1.17 MB
Format:
Loading...
Thumbnail Image
Name:
CMTpresentation.pdf
Size:
3.67 MB
Format:
Adobe Portable Document Format
Description:
Name:
CMTpresentation.pptx
Size:
3.37 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
CMTreport.pdf
Size:
2.44 MB
Format:
Adobe Portable Document Format
Name:
CMTreportOverleaf.zip
Size:
3.25 MB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: