Show simple item record

dc.contributor.authorEagan, Mackenzie
dc.contributor.authorLiang, Xiao
dc.contributor.authorMichael, Louis
dc.contributor.authorPatil, Supritha
dc.date.accessioned2017-12-26T15:45:39Z
dc.date.available2017-12-26T15:45:39Z
dc.date.issued2017-12-25
dc.identifier.urihttp://hdl.handle.net/10919/81428
dc.description.abstractThe Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files.en_US
dc.description.sponsorshipNational Science Foundationen_US
dc.description.sponsorshipNSF Grant IIS-1619028en_US
dc.language.isoen_USen_US
dc.publisherVirginia Polytechnic Institute and State Universityen_US
dc.subjectCollections Management Webpagesen_US
dc.subjectWebpagesen_US
dc.subjectWeb Crawlingen_US
dc.subjectCrawlingen_US
dc.subjectHadoopen_US
dc.subjectHDFSen_US
dc.subjectHBaseen_US
dc.subjectWARCen_US
dc.subjectInformation Storage and Retrievalen_US
dc.titleCollection Management Webpagesen_US
dc.typeDataseten_US
dc.typePresentationen_US
dc.typeReporten_US
dc.typeSoftwareen_US
dc.description.notesA breakdown of the attached files: final-report-cmw.pdf - A full length report detailing the efforts of the Collections Management Webpages (CMW) team in CS5604, as a PDF document. final-report-cmw.zip - A zip file of all of the relevant resources used to create the project including our LaTex and bibliography files, as well as images used in the report. SupportingFilesAndScritps.zip - The relevant files that were developed as a part of our efforts including cleaning scripts and example tab separated value (TSV) files. FinalPresentation.pdf - A PDF version of the presentation the group gave at the conclusion of the semester. FinalPresentation.pptx - A PowerPoint version of the presentation such that it can be edited by teams working to expand on this project.en_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record