Collection Management Webpages
dc.contributor.author | Eagan, Mackenzie | en |
dc.contributor.author | Liang, Xiao | en |
dc.contributor.author | Michael, Louis | en |
dc.contributor.author | Patil, Supritha | en |
dc.date.accessioned | 2017-12-26T15:45:39Z | en |
dc.date.available | 2017-12-26T15:45:39Z | en |
dc.date.issued | 2017-12-25 | en |
dc.description.abstract | The Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files. | en |
dc.description.notes | A breakdown of the attached files: final-report-cmw.pdf - A full length report detailing the efforts of the Collections Management Webpages (CMW) team in CS5604, as a PDF document. final-report-cmw.zip - A zip file of all of the relevant resources used to create the project including our LaTex and bibliography files, as well as images used in the report. SupportingFilesAndScritps.zip - The relevant files that were developed as a part of our efforts including cleaning scripts and example tab separated value (TSV) files. FinalPresentation.pdf - A PDF version of the presentation the group gave at the conclusion of the semester. FinalPresentation.pptx - A PowerPoint version of the presentation such that it can be edited by teams working to expand on this project. | en |
dc.description.sponsorship | National Science Foundation | en |
dc.description.sponsorship | NSF Grant IIS-1619028 | en |
dc.identifier.uri | http://hdl.handle.net/10919/81428 | en |
dc.language.iso | en_US | en |
dc.publisher | Virginia Polytechnic Institute and State University | en |
dc.rights | In Copyright | en |
dc.rights.uri | http://rightsstatements.org/vocab/InC/1.0/ | en |
dc.subject | Collections Management Webpages | en |
dc.subject | Webpages | en |
dc.subject | Web Crawling | en |
dc.subject | Crawling | en |
dc.subject | Hadoop | en |
dc.subject | HDFS | en |
dc.subject | HBase | en |
dc.subject | WARC | en |
dc.subject | Information Storage and Retrieval | en |
dc.title | Collection Management Webpages | en |
dc.type | Dataset | en |
dc.type | Presentation | en |
dc.type | Report | en |
dc.type | Software | en |
Files
Original bundle
1 - 5 of 5
License bundle
1 - 1 of 1
- Name:
- license.txt
- Size:
- 1.5 KB
- Format:
- Item-specific license agreed upon to submission
- Description: