Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019

Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin

Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019

dc.contributor.author	Muhundan, Sushmethaa	en
dc.contributor.author	Bendelac, Alon	en
dc.contributor.author	Zhao, Yan	en
dc.contributor.author	Svetovidov, Andrei	en
dc.contributor.author	Biswas, Debasmita	en
dc.contributor.author	Marin Thomas, Ashin	en
dc.date.accessioned	2020-01-15T02:55:02Z	en
dc.date.available	2020-01-15T02:55:02Z	en
dc.date.issued	2019-12-11	en
dc.description.abstract	Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.	en
dc.description.notes	CMTpresentation.pdf: PDF version of the final presentation CMTpresentation.pptx: PowerPoint version of the final presentation CMTreport.pdf: PDF version of the final report CMTreportOverleaf.zip: Archive of Overleaf project of report CMTcodebase.zip: Archive of software developed	en
dc.description.sponsorship	IMLS LG-37-19-0078-19	en
dc.identifier.uri	http://hdl.handle.net/10919/96437	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/us/	en
dc.subject	Information storage and retrieval	en
dc.subject	Tobacco Settlement Documents	en
dc.subject	Data pre-processing	en
dc.subject	Lemmatization	en
dc.subject	Tokenization	en
dc.subject	CS5604	en
dc.subject	Metadata extraction	en
dc.subject	Ceph	en
dc.subject	Python	en
dc.subject	Data cleaning	en
dc.title	Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019	en
dc.type	Presentation	en
dc.type	Report	en