Tobacco Settlement Documents


Tobacco companies have had some of the best marketing strategies over the past century. It is well documented and well known that tobacco produces both mental and physical health issues, and yet these companies have found ways to remain as one of the largest businesses. The goal of our project is to assist Dr. Townsend in his research to understand Big Tobacco’s strategies.

This is done by taking some of the fourteen million documents released by tobacco companies online and presenting the data in a meaningful way so they can be analyzed. This project is hosted on a Virtual Machine provided to the team by Dr. Fox and the VT Computer Science department. The idea for the project is to begin by gathering the documents from online, turning them into a usable text format, then feeding these documents to a Doc2Vec-based machine learning tool that was created with Gensim. Using a pre-trained model, we then need to take this data and cluster it so that it is presentable in a usable manner. Thus Dr. Townsend and many others can use this system to further their research.

This submission includes a report on how to use the system and maintain it. This way Dr. Townsend can do what he wants with the system, and any future developers can understand how the system works. This system is comprised of different online components such as a Gensim doc2vec model and a fast approximate nearest neighbor similarity package from Gensim to do the clustering of the data. This has all been stored and set up on the virtual machine provided by the CS department so it should be accessible as long as the user is connected to the campus wifi. Through this project our team learned many things about working with a client, working with new technologies, and how to go about tracking and presenting progress to others.



tobacco settlement documents, Doc2Vec, clustering