Tweet URL Analysis

The goal of the GETAR project is to devise interactive, integrated, digital library/archive systems coupled with linked and expert-curated web-page/tweet collections. In this class team project, the URL analysis system we designed takes a tweet collection as input and uses Hadoop and Spark to extract short URLs. We expanded them, fetched their web-page with the corresponding long URL, and applied the WayBack CDX Server API to attempt to restore the most likely snapshot. Then, we conducted a systematic URL analysis, for different types of events. We analyzed nine tweet collections in four categories: Nature, Health, Man-made, and Particular Event. Each tweet collection contains the tweet content from 2013-2017 that related to a specific keyword. For each collection, we analyzed several characteristics in URLs, top-k domains of the URLs, URL retrieve rate, and URL retrieve rate boosted by using the WayBack CDX Server API. We provided several visualizations of the results we analyzed from these nine tweet collections. We have refined this project so that it is easy to build on; see section 5 (Developer Manual) in the final report for details.

Keywords

Digital Library, Web-page, Data Mining, Hadoop, Scala, Tweet Collections, URLs

Persistent link

http://hdl.handle.net/10919/83219

Collections

CS4624: Multimedia, Hypertext, and Information Access

Full item page

Tweet URL Analysis

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections