CS 5604: Information Storage and Retrieval - Webpages (WP) Team

dc.contributor.authorBarry-Straume, Josteinen
dc.contributor.authorVives, Cristianen
dc.contributor.authorFan, Wentaoen
dc.contributor.authorTan, Pengen
dc.contributor.authorZhang, Shuaichengen
dc.contributor.authorHu, Yangen
dc.contributor.authorWilson, Tishaunaen
dc.date.accessioned2020-12-18T16:31:57Zen
dc.date.available2020-12-18T16:31:57Zen
dc.date.issued2020-12-18en
dc.description.abstractThe first major goal of this project is to build a state-of-the-art information retrieval engine for searching webpages and for opening up access to existing and new webpage collections resulting from Digital Library Research Laboratory (DLRL) projects relating to eventsarchive.org. The task of the Webpage (WP) team was to provide the functionality of making any archived webpage accessible and indexed. The webpages can be obtained either through event focused crawlers or collections of data, such as WARC files containing webpages, or sets of tweets which contains embedded URLs. Toward completion of the project, the WP team worked on four major tasks: 1.) Contents of WARC files searchable through ElasticSearch. 2.) Contents of WARC files cleaned and searchable through ElasticSearch. 3.) Event focused crawler running and producing WARC files. 4.) Additional extracted/derived information (e.g., dates, classes) made searchable. The foundation of the software is a Docker container cluster employing Airflow, a Reasoner, and Kubernetes. The raw data of the information content of the given webpage collections is stored using the Network File System (NFS), while Ceph is used for persistent storage for the Docker containers. Retrieval, analysis, and visualization of the webpage collection is carried out with ElasticSearch and Kibana, respectively. These two technologies form an Elastic Stack application which serves as the vehicle with which the WP team indexes, maps, and stores the processed data and model outputs with regards to webpage collections. The software is co-designed by 7 team members of Virginia Tech graduate students, all members of the same computer science class, CS 5604: Information Storage and Retrieval. The course is taught by Professor Edward A. Fox. Dr. Fox structures the class in a way for his students to perform in a “mock” business development setting. In other words, the academic project submitted by the WP team for all intents and purposes can be viewed as a microcosm of software development within a corporate structure. This submission focuses on the work of the WP team, which creates and administers Docker containers such that various services are tested and deployed in whole. Said services pertain solely to the ingestion, cleansing, analysis, extraction, classification, and indexing of webpages and their respective content.en
dc.description.notesWPpresentation.pdf: A PDF version of the final presentation for the 2020 Web pages Team. WPpresentation.pptx: A PowerPoint version of the final presentation for the 2020 Web pages Team. WPreport.pdf: The finalized report of the 2020 Web pages team. WPreport.zip: The Overleaf document of the 2020 Web pages team.en
dc.identifier.urihttp://hdl.handle.net/10919/101538en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsAttribution-NonCommercial-NoDerivatives 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en
dc.subjectNatural Language Processingen
dc.subjectInformation Retrievalen
dc.subjectInformation Storageen
dc.subjectWebpage Indexingen
dc.subjectText Classificationen
dc.subjectText Summarizationen
dc.subjectWebpage Archivingen
dc.titleCS 5604: Information Storage and Retrieval - Webpages (WP) Teamen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Name:
WPreport.zip
Size:
9.3 MB
Format:
Loading...
Thumbnail Image
Name:
WPreport.pdf
Size:
3.25 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
WPpresentation.pdf
Size:
5.47 MB
Format:
Adobe Portable Document Format
Name:
WPpresentation.pptx
Size:
10.68 MB
Format:
Microsoft Powerpoint XML
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: