Text Analytics and Machine Learning (TML) CS5604 Fall 2019

dc.contributor.authorMansur, Rifat Sabbiren
dc.contributor.authorMandke, Prathameshen
dc.contributor.authorGong, Jiayingen
dc.contributor.authorBharadwaj, Sandhya M.en
dc.contributor.authorJuvekar, Adheesh Sunilen
dc.contributor.authorChougule, Sharvarien
dc.date.accessioned2019-12-29T02:42:21Zen
dc.date.available2019-12-29T02:42:21Zen
dc.date.issued2019-12-29en
dc.description.abstractIn order to use the burgeoning amount of data for knowledge discovery, it is becoming increasingly important to build efficient and intelligent information retrieval systems.The challenge in informational retrieval lies not only in fetching the documents relevant to a query but also in ranking them in the order of relevance. The large size of the corpora as well as the variety in the content and the format of information pose additional challenges in the retrieval process. This calls for the use of text analytics and machine learning techniques to analyze and extract insights from the data to build an efficient retrieval system that enhances the overall user experience. With this background, the goal of the Text Analytics and Machine Learning team is to suitably augment the document indexing and demonstrate a qualitative improvement in the document retrieval. Further, we also plan to make use of document browsing and viewing logs to provide meaningful recommendations to the user. The goal of the class is to build an end-to-end information retrieval system for two document corpora, viz., Electronic Theses & Dissertations (ETDs) and Tobacco Settlement Records (TSRs). The ETDs are a collection of over 33,000 thesis and dissertation documents in VTechWorks at Virginia Tech. The challenge in building a retrieval system around this corpus lies in the distinct nature of ETDs as opposed to other well studied document formats such as conference/journal publications and web-pages. The TSR corpus consists of over 14M records covering formats ranging from letters and memos to image based advertisements. We seek to understand the nature of both these corpora as well as the information need patterns of the users in order to augment the index based search with domain specific information using machine learning based methods. Extending prior experiments, we investigate reasons for the unbalanced nature of the clusters from the previous iterations of the K-Means algorithm on the tobacco data. In addition, we explore and present preliminary results of running Agglomerative Clustering on a small subset of the tobacco data. We also explored different pre-trained models of detecting sentiments. We identified a package, empath, that shows better results in identifying emotions in the tobacco deposition documents. Besides, we implemented text summarization based on both Latent Semantic Analysis and the Luhn Algorithm on the tobacco (article) data (38,038 documents). We also implemented text summarization on a sample ETD chapter dataset.en
dc.description.notes# TMLreport.pdf = This is the report of our overall motivation, design, procedures, evaluations, etc. The report was created using Latex via Overleaf. # TMLreportOverleaf.zip = This is the source files from the Overleaf of our report. This can be used to make future changes to our report. # TMLpresentationPDF.pdf = This is our final presentation slides in PDF format. It contains our overall exploration and findings. # TMLpresentationSlides.pptx = This is our editable PowerPoint files to our presentation slides. # TMLcodeClustering.zip = This contains all the source code for clustering. # TMLcodeTextSummarization = This contains all the source code for text summarization # TMLcodeNER = This contains all the source code and sample output for named-entity recognition (NER) # TMLcodeSentimentAnalysis = This contains all the source code for sentiment analysisen
dc.description.sponsorshipIMLS LG-37-19-0078-19en
dc.description.sponsorshipDr. David M. Townsenden
dc.identifier.urihttp://hdl.handle.net/10919/96226en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/en
dc.subjectclusteringen
dc.subjecttext summarizationen
dc.subjectsentiment analysisen
dc.subjectrecommender systemen
dc.subjectnamed-entity recognitionen
dc.subjectelectronic thesis and dissertationen
dc.subjecttobacco documentsen
dc.subjectsearch optimizationen
dc.titleText Analytics and Machine Learning (TML) CS5604 Fall 2019en
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 8
Name:
TMLcodeClustering.zip
Size:
746.45 KB
Format:
Name:
TMLcodeTextSummarization.zip
Size:
6.11 KB
Format:
Name:
TMLcodeNER.zip
Size:
11.05 MB
Format:
Name:
TMLcodeSentimentAnalysis.zip
Size:
1.75 KB
Format:
Loading...
Thumbnail Image
Name:
TMLreport.pdf
Size:
2.31 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: