Clustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016

dc.contributor.authorBartolome, Abigailen
dc.contributor.authorIslam, M. D.en
dc.contributor.authorVundekode, Soumyaen
dc.date.accessioned2016-12-18T02:22:28Zen
dc.date.available2016-12-18T02:22:28Zen
dc.date.issued2016-12-08en
dc.description.abstractThe IDEAL (Integrated Digital Event Archiving and Library) and Global Event and Trend Archive Research (GETAR) projects aim to build a robust Information Retrieval (IR) system by retrieving tweets and webpages from social media and the World Wide Web, and indexing them to be easily retrieved and analyzed. The project has been divided into different segments - Classification (CLA), Collection Management (tweets - CMT and webpages - CMW), Clustering and Topic Analysis (CTA), SOLR, and Front-End (FE). In building IR systems, documents are scored for relevance. To assist in determining a document’s relevance to a query, it is useful to know what topics are associated with the documents and what other documents relate to it. We, as the CTA team, used topic analysis and clustering techniques to aid in building this IR system. Our contributions were useful in scoring which documents are most relevant to a user’s query. We ran clustering and topic analysis algorithms on collections of tweets and webpages to identify the most discussed topics and grouped them into clusters along with their respective probabilities. We also labeled the topics and clusters, aiming for intuitive labels. The report and presentation cover the background, requirements, design and implementation of our contributions to this project. We evaluated the quality of our methodologies and describe improvements or future work that could be done to extend our project. Furthermore, we include a user manual and a developer manual to assist in any future work that may come from our efforts.en
dc.description.notesCTA_presentation.pdf : Final Presentation in PDF CTA_presentation.pptx : Final presentation in PPT CTA_FinalReport.pdf : Final report in PDF CTA_FinalReport.docx : Final report in docx format CTA_source_code.zip : source code required for the completion of the project CTA_results.zip : final clusters and final topics for real world eventsen
dc.description.sponsorshipNSF: IIS-1319578en
dc.description.sponsorshipNSF: IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/73712en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/en
dc.subjectinformation retrievalen
dc.subjectclusteringen
dc.subjecttopic analysisen
dc.titleClustering and Topic Analysis in CS 5604 Information Retrieval Fall 2016en
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 6
Name:
CTA_results.zip
Size:
51.19 KB
Format:
Name:
CTA_source_code.zip
Size:
4.57 MB
Format:
Loading...
Thumbnail Image
Name:
CTA_presentation.pdf
Size:
857.22 KB
Format:
Adobe Portable Document Format
Name:
CTA_presentation.pptx
Size:
1.72 MB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
CTA_FinalReport.pdf
Size:
1.22 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: