Collection Management Webpages

dc.contributor.authorEagan, Mackenzieen
dc.contributor.authorLiang, Xiaoen
dc.contributor.authorMichael, Louisen
dc.contributor.authorPatil, Suprithaen
dc.date.accessioned2017-12-26T15:45:39Zen
dc.date.available2017-12-26T15:45:39Zen
dc.date.issued2017-12-25en
dc.description.abstractThe Collection Management Webpages team is responsible for collecting, processing, and storing webpages from different sources. Our team worked on familiarizing ourselves with the necessary tools and data required to produce the specified output that was used by other teams in this class (Fall 2017 CS 5604). Input includes URLs generated by the Event Focused Crawler (EFC), URLs obtained from tweets by the Collection Management Tweets team, and webpage content from Web Archive (WARC) files from the Internet Archive or other sources. Our team fetches raw HTML from the obtained URLs and extracts HTML from WARC files. From this raw data, we obtain metadata information about the corresponding webpage. The raw data is also cleaned and processed for other teams' consumption. This processing is accomplished using various Python libraries. The cleaned information is made available in a variety of formats, including tokens, stemmed or lemmatized text, and text tagged with parts of speech. Both the raw and processed webpage data are stored in HBase and intermediately in HDFS (Hadoop Distributed File System). Our team successfully executed all individual portions of our proposed process. We successfully ran the EFC and obtained URLs from these runs. Using these URLs, we created WARC files. We obtained the raw HTML, extracted metadata information from it, and cleaned and processed the webpage information before uploading it to HBase. We iteratively expanded on the functionality of our cleaning and processing scripts in order to provide more relevant information to other groups. We processed and cleaned information from WARC files provided by the instructor in a similar manner. We have acquired webpage data from URLs obtained by the Collection Management Tweets (CMT) team. At this time however, there is no end-to-end process in place. Due to the volume of data our team has been dealing with, we explored various methods for parallelizing and speeding up our processes. Our team used the PySpark library for obtaining information from URLs and the multiprocessing library in Python for processing information stored in WARC files.en
dc.description.notesA breakdown of the attached files: final-report-cmw.pdf - A full length report detailing the efforts of the Collections Management Webpages (CMW) team in CS5604, as a PDF document. final-report-cmw.zip - A zip file of all of the relevant resources used to create the project including our LaTex and bibliography files, as well as images used in the report. SupportingFilesAndScritps.zip - The relevant files that were developed as a part of our efforts including cleaning scripts and example tab separated value (TSV) files. FinalPresentation.pdf - A PDF version of the presentation the group gave at the conclusion of the semester. FinalPresentation.pptx - A PowerPoint version of the presentation such that it can be edited by teams working to expand on this project.en
dc.description.sponsorshipNational Science Foundationen
dc.description.sponsorshipNSF Grant IIS-1619028en
dc.identifier.urihttp://hdl.handle.net/10919/81428en
dc.language.isoen_USen
dc.publisherVirginia Polytechnic Institute and State Universityen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCollections Management Webpagesen
dc.subjectWebpagesen
dc.subjectWeb Crawlingen
dc.subjectCrawlingen
dc.subjectHadoopen
dc.subjectHDFSen
dc.subjectHBaseen
dc.subjectWARCen
dc.subjectInformation Storage and Retrievalen
dc.titleCollection Management Webpagesen
dc.typeDataseten
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
SupportingFilesAndScripts.zip
Size:
5.79 MB
Format:
Loading...
Thumbnail Image
Name:
FinalPresentation.pdf
Size:
196.01 KB
Format:
Adobe Portable Document Format
Name:
FinalPresentation.pptx
Size:
363.13 KB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
final-report-cmw.pdf
Size:
500.39 KB
Format:
Adobe Portable Document Format
Name:
final-report-cmw.zip
Size:
322 KB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: