English Wikipedia on Hadoop Cluster

Stulga, Steven

English Wikipedia on Hadoop Cluster

dc.contributor.author	Stulga, Steven	en
dc.date.accessioned	2016-05-07T22:28:32Z	en
dc.date.available	2016-05-07T22:28:32Z	en
dc.date.issued	2016-05-04	en
dc.description	CS 4624 Multimedia/Hypertext/Information Retrieval Final Project Files submitted: CS4624WikipediaHadoopReport.docx - Final Report in DOCX CS4624WikipediaHadoopReport.pdf- Final Report in PDF CS4624WikipediaHadoopPresentation.pptx - Final Presentation in PPTX CS4624WikipediaHadoopPresentation.pdf - Final Presentation in PDF wikipedia_hadoop.zip - Project files and data	en
dc.description.abstract	To develop and test big data software, one thing that is required is a big dataset. The full English Wikipedia dataset would serve well for testing and benchmarking purposes. Loading this dataset onto a system, such as an Apache Hadoop cluster, and indexing it into Apache Solr, would allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software. This project is on importing the full English Wikipedia into an Apache Hadoop cluster and indexing it by Apache Solr, so that it can be searched. A prototype was designed and implemented. A small subset of the Wikipedia data was unpacked and imported into Apache Hadoop’s HDFS. The entire Wikipedia Dataset was also downloaded onto a Hadoop Cluster at Virginia Tech. A portion of the dataset was converted from XML to Avro and imported into HDFS on the cluster. Future work would be to finish unpacking the full dataset and repeat the steps carried out with the prototype system, for all of WIkipedia. Unpacking the remaining data, converting it to Avro, and importing it into HDFS can be done with minimal adjustments to the script written for this job. Continuously run, this job would take an estimated 30 hours to complete.	en
dc.description.sponsorship	NSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.description.sponsorship	Shivam Maharshi	en
dc.description.sponsorship	Sunshin Lee	en
dc.description.sponsorship	Edward Fox	en
dc.identifier.uri	http://hdl.handle.net/10919/70932	en
dc.language.iso	en_US	en
dc.rights	Creative Commons Attribution 3.0 United States	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/us/	en
dc.subject	Wikipedia	en
dc.subject	Hadoop Cluster	en
dc.subject	Solr	en
dc.subject	XML	en
dc.subject	Avro	en
dc.subject	Apache	en
dc.title	English Wikipedia on Hadoop Cluster	en
dc.type	Dataset	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en