Tweet URL Analysis

dc.contributor.authorLi, Liyanen
dc.contributor.authorLyu, Kehanen
dc.contributor.authorSun, Guoxinen
dc.date.accessioned2018-05-11T07:15:13Zen
dc.date.available2018-05-11T07:15:13Zen
dc.date.issued2018-05-02en
dc.description.abstractThe goal of the GETAR project is to devise interactive, integrated, digital library/archive systems coupled with linked and expert-curated web-page/tweet collections. In this class team project, the URL analysis system we designed takes a tweet collection as input and uses Hadoop and Spark to extract short URLs. We expanded them, fetched their web-page with the corresponding long URL, and applied the WayBack CDX Server API to attempt to restore the most likely snapshot. Then, we conducted a systematic URL analysis, for different types of events. We analyzed nine tweet collections in four categories: Nature, Health, Man-made, and Particular Event. Each tweet collection contains the tweet content from 2013-2017 that related to a specific keyword. For each collection, we analyzed several characteristics in URLs, top-k domains of the URLs, URL retrieve rate, and URL retrieve rate boosted by using the WayBack CDX Server API. We provided several visualizations of the results we analyzed from these nine tweet collections. We have refined this project so that it is easy to build on; see section 5 (Developer Manual) in the final report for details.en
dc.description.notesTUA_source_code.zip - source code for this project; [TUA_final_report.pdf, TUA_final_report.zip (report in LaTex)] - final reports in both PDF and LaTex formats; [TUA_final_presentation.pdf, TUA_final_presentation.pptx] - final presentations in both PDF and .pptx formats.en
dc.description.sponsorshipNSF grant IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/83219en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution-NonCommercial 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/3.0/us/en
dc.subjectDigital Libraryen
dc.subjectWeb-pageen
dc.subjectData Miningen
dc.subjectHadoopen
dc.subjectScalaen
dc.subjectTweet Collectionsen
dc.subjectURLsen
dc.titleTweet URL Analysisen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen
dc.typeOtheren

Files

Original bundle
Now showing 1 - 5 of 5
Name:
TUA_source_code.zip
Size:
271.87 KB
Format:
Loading...
Thumbnail Image
Name:
TUA_final_report.pdf
Size:
3.63 MB
Format:
Adobe Portable Document Format
Name:
TUA_final_report.zip
Size:
4.01 MB
Format:
Loading...
Thumbnail Image
Name:
TUA_final_presentation.pdf
Size:
1.94 MB
Format:
Adobe Portable Document Format
Name:
TUA_final_presentation.pptx
Size:
6.53 MB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: