Collection Management Webpages

Eagan, Mackenzie; Liang, Xiao; Michael, Louis; Patil, Supritha

Collection Management Webpages

dc.contributor.author	Eagan, Mackenzie	en
dc.contributor.author	Liang, Xiao	en
dc.contributor.author	Michael, Louis	en
dc.contributor.author	Patil, Supritha	en
dc.date.accessioned	2017-12-26T15:45:39Z	en
dc.date.available	2017-12-26T15:45:39Z	en
dc.date.issued	2017-12-25	en
dc.description.abstract	The Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files.	en
dc.description.notes	A breakdown of the attached files: final-report-cmw.pdf - A full length report detailing the efforts of the Collections Management Webpages (CMW) team in CS5604, as a PDF document. final-report-cmw.zip - A zip file of all of the relevant resources used to create the project including our LaTex and bibliography files, as well as images used in the report. SupportingFilesAndScritps.zip - The relevant files that were developed as a part of our efforts including cleaning scripts and example tab separated value (TSV) files. FinalPresentation.pdf - A PDF version of the presentation the group gave at the conclusion of the semester. FinalPresentation.pptx - A PowerPoint version of the presentation such that it can be edited by teams working to expand on this project.	en
dc.description.sponsorship	National Science Foundation	en
dc.description.sponsorship	NSF Grant IIS-1619028	en
dc.identifier.uri	http://hdl.handle.net/10919/81428	en
dc.language.iso	en_US	en
dc.publisher	Virginia Polytechnic Institute and State University	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Collections Management Webpages	en
dc.subject	Webpages	en
dc.subject	Web Crawling	en
dc.subject	Crawling	en
dc.subject	Hadoop	en
dc.subject	HDFS	en
dc.subject	HBase	en
dc.subject	WARC	en
dc.subject	Information Storage and Retrieval	en
dc.title	Collection Management Webpages	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Report	en
dc.type	Software	en

Collection Management Webpages

Files

Original bundle

License bundle

Collections