Building CTRnet Digital Library Services using Archive-It and LucidWorks Big Data Software

Chitturi, Kiran

Building CTRnet Digital Library Services using Archive-It and LucidWorks Big Data Software

dc.contributor.author	Chitturi, Kiran	en
dc.contributor.committeechair	Fox, Edward A.	en
dc.contributor.committeemember	Sheetz, Steven D.	en
dc.contributor.committeemember	Yao, Danfeng (Daphne)	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2014-03-28T08:00:17Z	en
dc.date.available	2014-03-28T08:00:17Z	en
dc.date.issued	2014-03-27	en
dc.description.abstract	When a crisis occurs, information flows rapidly in the Web through social media, blogs, and news articles. The shared information captures the reactions, impacts, and responses from the government as well as the public. Later, researchers, scholars, students, and others seek information about earlier events, sometimes for cross-event analysis or comparison. There are very few integrated systems which try to collect and permanently archive the information about an event and provide access to the crisis information at the same time. In this thesis, we describe the CTRnet Digital Library and Archive which aims to permanently archive crisis event information by using Archive-It services and then provide access to the archived information by using LucidWorks Big Data software. Through the Big Data (LWBD) software, we take advantage of text extraction, clustering, similarity, annotation, and indexing services and build digital libraries with the generated metadata that will be helpful for the system stakeholders to locate information about an event. Through this study, we collected data for 46 crises events using Archive-It. We built a CTRnet DL prototype and its services for the ``Boston Marathon Bombing" collection by using the components of LucidWorks Big Data. Running LucidWorks Big Data on a 30 node Hadoop cluster accelerates the sub-workflows processing and also provides fault tolerant execution. LWBD sub-workflows, ``ingest" and ``extract", processed the textual data present in the WARC files. Other sub-workflows ``kmeans", ``simdoc", and ``annotate" helped in grouping the search-results, deleting the duplicates and providing metadata for additional facets in the CTRnet DL prototype, respectively.	en
dc.description.degree	Master of Science	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:1340	en
dc.identifier.uri	http://hdl.handle.net/10919/46865	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Digital Library Services	en
dc.subject	CTRnet	en
dc.subject	Internet Archive	en
dc.subject	LucidWorks	en
dc.subject	Big Data	en
dc.subject	Crises	en
dc.subject	Archive-It	en
dc.title	Building CTRnet Digital Library Services using Archive-It and LucidWorks Big Data Software	en
dc.type	Thesis	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	masters	en
thesis.degree.name	Master of Science	en

Files

Original bundle

Now showing 1 - 2 of 2

Name:: Chitturi_K_T_2014.pdf
Size:: 2.4 MB
Format:: Adobe Portable Document Format

Download

Name:: Chitturi_K_T_2014_support_1.pdf
Size:: 79.42 KB
Format:: Adobe Portable Document Format
Description:: Supporting documents

Download

Collections

Masters Theses