Text Analytics and Machine Learning (TML) CS5604 Fall 2019

Mansur, Rifat Sabbir; Mandke, Prathamesh; Gong, Jiaying; Bharadwaj, Sandhya M.; Juvekar, Adheesh Sunil; Chougule, Sharvari

Text Analytics and Machine Learning (TML) CS5604 Fall 2019

dc.contributor.author	Mansur, Rifat Sabbir	en
dc.contributor.author	Mandke, Prathamesh	en
dc.contributor.author	Gong, Jiaying	en
dc.contributor.author	Bharadwaj, Sandhya M.	en
dc.contributor.author	Juvekar, Adheesh Sunil	en
dc.contributor.author	Chougule, Sharvari	en
dc.date.accessioned	2019-12-29T02:42:21Z	en
dc.date.available	2019-12-29T02:42:21Z	en
dc.date.issued	2019-12-29	en
dc.description.abstract	In order to use the burgeoning amount of data for knowledge discovery, it is becoming increasingly important to build efficient and intelligent information retrieval systems.The challenge in informational retrieval lies not only in fetching the documents relevant to a query but also in ranking them in the order of relevance. The large size of the corpora as well as the variety in the content and the format of information pose additional challenges in the retrieval process. This calls for the use of text analytics and machine learning techniques to analyze and extract insights from the data to build an efficient retrieval system that enhances the overall user experience. With this background, the goal of the Text Analytics and Machine Learning team is to suitably augment the document indexing and demonstrate a qualitative improvement in the document retrieval. Further, we also plan to make use of document browsing and viewing logs to provide meaningful recommendations to the user. The goal of the class is to build an end-to-end information retrieval system for two document corpora, viz., Electronic Theses & Dissertations (ETDs) and Tobacco Settlement Records (TSRs). The ETDs are a collection of over 33,000 thesis and dissertation documents in VTechWorks at Virginia Tech. The challenge in building a retrieval system around this corpus lies in the distinct nature of ETDs as opposed to other well studied document formats such as conference/journal publications and web-pages. The TSR corpus consists of over 14M records covering formats ranging from letters and memos to image based advertisements. We seek to understand the nature of both these corpora as well as the information need patterns of the users in order to augment the index based search with domain specific information using machine learning based methods. Extending prior experiments, we investigate reasons for the unbalanced nature of the clusters from the previous iterations of the K-Means algorithm on the tobacco data. In addition, we explore and present preliminary results of running Agglomerative Clustering on a small subset of the tobacco data. We also explored different pre-trained models of detecting sentiments. We identified a package, empath, that shows better results in identifying emotions in the tobacco deposition documents. Besides, we implemented text summarization based on both Latent Semantic Analysis and the Luhn Algorithm on the tobacco (article) data (38,038 documents). We also implemented text summarization on a sample ETD chapter dataset.	en
dc.description.notes	# TMLreport.pdf = This is the report of our overall motivation, design, procedures, evaluations, etc. The report was created using Latex via Overleaf. # TMLreportOverleaf.zip = This is the source files from the Overleaf of our report. This can be used to make future changes to our report. # TMLpresentationPDF.pdf = This is our final presentation slides in PDF format. It contains our overall exploration and findings. # TMLpresentationSlides.pptx = This is our editable PowerPoint files to our presentation slides. # TMLcodeClustering.zip = This contains all the source code for clustering. # TMLcodeTextSummarization = This contains all the source code for text summarization # TMLcodeNER = This contains all the source code and sample output for named-entity recognition (NER) # TMLcodeSentimentAnalysis = This contains all the source code for sentiment analysis	en
dc.description.sponsorship	IMLS LG-37-19-0078-19	en
dc.description.sponsorship	Dr. David M. Townsend	en
dc.identifier.uri	http://hdl.handle.net/10919/96226	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution-ShareAlike 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by-sa/3.0/us/	en
dc.subject	clustering	en
dc.subject	text summarization	en
dc.subject	sentiment analysis	en
dc.subject	recommender system	en
dc.subject	named-entity recognition	en
dc.subject	electronic thesis and dissertation	en
dc.subject	tobacco documents	en
dc.subject	search optimization	en
dc.title	Text Analytics and Machine Learning (TML) CS5604 Fall 2019	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en