Show simple item record

dc.contributor.authorDao, Tungen
dc.contributor.authorWakeley, Christopheren
dc.contributor.authorWeigang, Liuen
dc.description.abstractThe Collection Management Webpages (CMW) team is responsible for collecting, processing and storing webpages from different sources including tweets from multiple collections and contributors, such as those related to events and trends studied in local projects like IDEAL/GETAR, and webpage archives collected by Pranav Nakate, Mohamed Farag, and others. Thus, based on webpage sources, we divide our work into the three following deliverable and manageable tasks. The first task is to fetch the webpages mentioned in the tweets that are collected by the Collection Management Tweets (CMT) team. Those webpages are then stored in WARC files, processed, and loaded into HBase. The second task is to run focused crawls for all of the events mentioned in IDEAL/GETAR to collect relevant webpages. And similar to the first task, we would then store the webpages into WARC files, process them, and load them into HBase. We also plan to achieve the third task which is similar to the first two, except that the webpages are from archives collected by the people previously involved in the project. Since these tasks are time-consuming and sensitive to real-time processing requirements, it is essential that our approach be incremental, meaning that webpages need to be incrementally collected, processed, and stored to HBase. We have conducted multiple experiments for the first, second, and third tasks, on our local machines as well as the cluster. For the second task, we manually collected a number of seed URLs of events, namely “South China Sea Disputes”, “USA President Election 2016”, and “South Korean President Protest”, to train the focused event crawler, and then ran the trained model on a small number of URLs that are randomly generated as well as manually collected. Encouragingly, these experiments ran successfully; however, we still have to work to scale up the experimenting data to be systematically run on the cluster. The two main components to be further improved and tested are the HBase data connector and handler, and the focused event crawler. While focusing on our own tasks, the CMW team works closely with other teams whose inputs and outputs depend on our team. For example, the front-end (FE) team might use our results for their front-end content. We discussed with the Classification (CLA) team to have some agreements on filtering and noise reducing tasks. Also, we made sure that we would get the right format URLs from the Collection Management Tweets (CMT) team. In addition, the other two teams, Clustering and Topic Analysis (CTA) and SOLR, will use our team’s outputs for topic analyzing and indexing, respectively. For instance, based on the SOLR team’s requests and consensus, we have finalized a schema (i.e., specific fields of information) for a webpage to be collected and stored. In this final report, we report our CMW team’s overall results and progress. Essentially, this report is a revised version of our three interim reports based on Dr. Fox’s and peer-reviewers’ comments. Besides to this revising, we continue reporting our ongoing work, challenges, processes, evaluations, and plans.en
dc.description.sponsorshipNSF IIS-1319578 and 1619028en
dc.publisherVirginia Techen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.subjectInformation Retrievalen
dc.subjectWeb Crawlingen
dc.subjectWebpage Collectionen
dc.subjectFocused Crawleren
dc.titleCollection Management Webpages - Fall 2016 CS5604en
dc.description.notesThis submission includes the following files: 1- CS5604Fall2016_CMW_Report (in Word and PDF format): the final report describing the team's overall work and findings. 2- CS5604Fall2016_CMW_Presentation (in PowerPoint and PDF format): the final presentation the team presented before the class. 3- contains scripts that: 3.1- fetch webpages in HTML and save them into WARC 3.2- save webpages into HBase 3.3- run event focus crawler (efc) to collect webpages 4- contains data generated by the efc.en

Files in this item


This item appears in the following Collection(s)

Show simple item record

Creative Commons Attribution 3.0 United States
License: Creative Commons Attribution 3.0 United States