Text Analytics and Machine Learning (TML) CS5604 Fall 2019
dc.contributor.author | Mansur, Rifat Sabbir | en |
dc.contributor.author | Mandke, Prathamesh | en |
dc.contributor.author | Gong, Jiaying | en |
dc.contributor.author | Bharadwaj, Sandhya M. | en |
dc.contributor.author | Juvekar, Adheesh Sunil | en |
dc.contributor.author | Chougule, Sharvari | en |
dc.date.accessioned | 2019-12-29T02:42:21Z | en |
dc.date.available | 2019-12-29T02:42:21Z | en |
dc.date.issued | 2019-12-29 | en |
dc.description.abstract | In order to use the burgeoning amount of data for knowledge discovery, it is becoming increasingly important to build efficient and intelligent information retrieval systems.The challenge in informational retrieval lies not only in fetching the documents relevant to a query but also in ranking them in the order of relevance. The large size of the corpora as well as the variety in the content and the format of information pose additional challenges in the retrieval process. This calls for the use of text analytics and machine learning techniques to analyze and extract insights from the data to build an efficient retrieval system that enhances the overall user experience. With this background, the goal of the Text Analytics and Machine Learning team is to suitably augment the document indexing and demonstrate a qualitative improvement in the document retrieval. Further, we also plan to make use of document browsing and viewing logs to provide meaningful recommendations to the user. The goal of the class is to build an end-to-end information retrieval system for two document corpora, viz., Electronic Theses & Dissertations (ETDs) and Tobacco Settlement Records (TSRs). The ETDs are a collection of over 33,000 thesis and dissertation documents in VTechWorks at Virginia Tech. The challenge in building a retrieval system around this corpus lies in the distinct nature of ETDs as opposed to other well studied document formats such as conference/journal publications and web-pages. The TSR corpus consists of over 14M records covering formats ranging from letters and memos to image based advertisements. We seek to understand the nature of both these corpora as well as the information need patterns of the users in order to augment the index based search with domain specific information using machine learning based methods. Extending prior experiments, we investigate reasons for the unbalanced nature of the clusters from the previous iterations of the K-Means algorithm on the tobacco data. In addition, we explore and present preliminary results of running Agglomerative Clustering on a small subset of the tobacco data. We also explored different pre-trained models of detecting sentiments. We identified a package, empath, that shows better results in identifying emotions in the tobacco deposition documents. Besides, we implemented text summarization based on both Latent Semantic Analysis and the Luhn Algorithm on the tobacco (article) data (38,038 documents). We also implemented text summarization on a sample ETD chapter dataset. | en |
dc.description.notes | # TMLreport.pdf = This is the report of our overall motivation, design, procedures, evaluations, etc. The report was created using Latex via Overleaf. # TMLreportOverleaf.zip = This is the source files from the Overleaf of our report. This can be used to make future changes to our report. # TMLpresentationPDF.pdf = This is our final presentation slides in PDF format. It contains our overall exploration and findings. # TMLpresentationSlides.pptx = This is our editable PowerPoint files to our presentation slides. # TMLcodeClustering.zip = This contains all the source code for clustering. # TMLcodeTextSummarization = This contains all the source code for text summarization # TMLcodeNER = This contains all the source code and sample output for named-entity recognition (NER) # TMLcodeSentimentAnalysis = This contains all the source code for sentiment analysis | en |
dc.description.sponsorship | IMLS LG-37-19-0078-19 | en |
dc.description.sponsorship | Dr. David M. Townsend | en |
dc.identifier.uri | http://hdl.handle.net/10919/96226 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Tech | en |
dc.rights | Creative Commons Attribution-ShareAlike 3.0 United States | en |
dc.rights.uri | http://creativecommons.org/licenses/by-sa/3.0/us/ | en |
dc.subject | clustering | en |
dc.subject | text summarization | en |
dc.subject | sentiment analysis | en |
dc.subject | recommender system | en |
dc.subject | named-entity recognition | en |
dc.subject | electronic thesis and dissertation | en |
dc.subject | tobacco documents | en |
dc.subject | search optimization | en |
dc.title | Text Analytics and Machine Learning (TML) CS5604 Fall 2019 | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 8
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: