Collection Management Webpages - Fall 2016 CS5604

Dao, Tung; Wakeley, Christopher; Weigang, Liu

Collection Management Webpages - Fall 2016 CS5604

dc.contributor.author	Dao, Tung	en
dc.contributor.author	Wakeley, Christopher	en
dc.contributor.author	Weigang, Liu	en
dc.date.accessioned	2017-03-25T03:04:18Z	en
dc.date.available	2017-03-25T03:04:18Z	en
dc.date.issued	2017-03-23	en
dc.description.abstract	The Collection Management Webpages (CMW) team is responsible for collecting, processing and storing webpages from different sources including tweets from multiple collections and contributors, such as those related to events and trends studied in local projects like IDEAL/GETAR, and webpage archives collected by Pranav Nakate, Mohamed Farag, and others. Thus, based on webpage sources, we divide our work into the three following deliverable and manageable tasks. The first task is to fetch the webpages mentioned in the tweets that are collected by the Collection Management Tweets (CMT) team. Those webpages are then stored in WARC files, processed, and loaded into HBase. The second task is to run focused crawls for all of the events mentioned in IDEAL/GETAR to collect relevant webpages. And similar to the first task, we would then store the webpages into WARC files, process them, and load them into HBase. We also plan to achieve the third task which is similar to the first two, except that the webpages are from archives collected by the people previously involved in the project. Since these tasks are time-consuming and sensitive to real-time processing requirements, it is essential that our approach be incremental, meaning that webpages need to be incrementally collected, processed, and stored to HBase. We have conducted multiple experiments for the first, second, and third tasks, on our local machines as well as the cluster. For the second task, we manually collected a number of seed URLs of events, namely “South China Sea Disputes”, “USA President Election 2016”, and “South Korean President Protest”, to train the focused event crawler, and then ran the trained model on a small number of URLs that are randomly generated as well as manually collected. Encouragingly, these experiments ran successfully; however, we still have to work to scale up the experimenting data to be systematically run on the cluster. The two main components to be further improved and tested are the HBase data connector and handler, and the focused event crawler. While focusing on our own tasks, the CMW team works closely with other teams whose inputs and outputs depend on our team. For example, the front-end (FE) team might use our results for their front-end content. We discussed with the Classification (CLA) team to have some agreements on filtering and noise reducing tasks. Also, we made sure that we would get the right format URLs from the Collection Management Tweets (CMT) team. In addition, the other two teams, Clustering and Topic Analysis (CTA) and SOLR, will use our team’s outputs for topic analyzing and indexing, respectively. For instance, based on the SOLR team’s requests and consensus, we have finalized a schema (i.e., specific fields of information) for a webpage to be collected and stored. In this final report, we report our CMW team’s overall results and progress. Essentially, this report is a revised version of our three interim reports based on Dr. Fox’s and peer-reviewers’ comments. Besides to this revising, we continue reporting our ongoing work, challenges, processes, evaluations, and plans.	en
dc.description.notes	This submission includes the following files: 1- CS5604Fall2016_CMW_Report (in Word and PDF format): the final report describing the team's overall work and findings. 2- CS5604Fall2016_CMW_Presentation (in PowerPoint and PDF format): the final presentation the team presented before the class. 3- CS5604Fall2016_CMW_Software.zip contains scripts that: 3.1- fetch webpages in HTML and save them into WARC 3.2- save webpages into HBase 3.3- run event focus crawler (efc) to collect webpages 4- CS5604Fall2016_CMW_efcData.zip: contains data generated by the efc.	en
dc.description.sponsorship	NSF IIS-1319578 and 1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/76675	en
dc.language.iso	en_US	en
dc.publisher	Virginia Tech	en
dc.rights	Creative Commons Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	en
dc.subject	Information Retrieval	en
dc.subject	Web Crawling	en
dc.subject	Webpage Collection	en
dc.subject	Focused Crawler	en
dc.subject	WARC	en
dc.title	Collection Management Webpages - Fall 2016 CS5604	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en