Show simple item record

dc.contributor.authorLi, Liyan
dc.contributor.authorLyu, Kehan
dc.contributor.authorSun, Guoxin
dc.date.accessioned2018-05-11T07:15:13Z
dc.date.available2018-05-11T07:15:13Z
dc.date.issued2018-05-02
dc.identifier.urihttp://hdl.handle.net/10919/83219
dc.description.abstractThe goal of the GETAR project is to devise interactive, integrated, digital library/archive systems coupled with linked and expert-curated web-page/tweet collections. In this class team project, the URL analysis system we designed takes a tweet collection as input and uses Hadoop and Spark to extract short URLs. We expanded them, fetched their web-page with the corresponding long URL, and applied the WayBack CDX Server API to attempt to restore the most likely snapshot. Then, we conducted a systematic URL analysis, for different types of events. We analyzed nine tweet collections in four categories: Nature, Health, Man-made, and Particular Event. Each tweet collection contains the tweet content from 2013-2017 that related to a specific keyword. For each collection, we analyzed several characteristics in URLs, top-k domains of the URLs, URL retrieve rate, and URL retrieve rate boosted by using the WayBack CDX Server API. We provided several visualizations of the results we analyzed from these nine tweet collections. We have refined this project so that it is easy to build on; see section 5 (Developer Manual) in the final report for details.en_US
dc.description.sponsorshipNSF grant IIS-1619028en_US
dc.language.isoen_USen_US
dc.publisherVirginia Techen_US
dc.rightsAttribution-NonCommercial 3.0 United States*
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/*
dc.subjectDigital Libraryen_US
dc.subjectWeb-pageen_US
dc.subjectData Miningen_US
dc.subjectHadoopen_US
dc.subjectScalaen_US
dc.subjectTweet Collectionsen_US
dc.subjectURLsen_US
dc.titleTweet URL Analysisen_US
dc.typeDataseten_US
dc.typePresentationen_US
dc.typeReporten_US
dc.typeSoftwareen_US
dc.typeOtheren_US
dc.description.notesTUA_source_code.zip - source code for this project; [TUA_final_report.pdf, TUA_final_report.zip (report in LaTex)] - final reports in both PDF and LaTex formats; [TUA_final_presentation.pdf, TUA_final_presentation.pptx] - final presentations in both PDF and .pptx formats.en_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial 3.0 United States
License: Attribution-NonCommercial 3.0 United States