English Wikipedia on Hadoop Cluster

dc.contributor.authorStulga, Stevenen
dc.date.accessioned2016-05-07T22:28:32Zen
dc.date.available2016-05-07T22:28:32Zen
dc.date.issued2016-05-04en
dc.descriptionCS 4624 Multimedia/Hypertext/Information Retrieval Final Project Files submitted: CS4624WikipediaHadoopReport.docx - Final Report in DOCX CS4624WikipediaHadoopReport.pdf- Final Report in PDF CS4624WikipediaHadoopPresentation.pptx - Final Presentation in PPTX CS4624WikipediaHadoopPresentation.pdf - Final Presentation in PDF wikipedia_hadoop.zip - Project files and dataen
dc.description.abstractTo develop and test big data software, one thing that is required is a big dataset. The full English Wikipedia dataset would serve well for testing and benchmarking purposes. Loading this dataset onto a system, such as an Apache Hadoop cluster, and indexing it into Apache Solr, would allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software. This project is on importing the full English Wikipedia into an Apache Hadoop cluster and indexing it by Apache Solr, so that it can be searched. A prototype was designed and implemented. A small subset of the Wikipedia data was unpacked and imported into Apache Hadoop’s HDFS. The entire Wikipedia Dataset was also downloaded onto a Hadoop Cluster at Virginia Tech. A portion of the dataset was converted from XML to Avro and imported into HDFS on the cluster. Future work would be to finish unpacking the full dataset and repeat the steps carried out with the prototype system, for all of WIkipedia. Unpacking the remaining data, converting it to Avro, and importing it into HDFS can be done with minimal adjustments to the script written for this job. Continuously run, this job would take an estimated 30 hours to complete.en
dc.description.sponsorshipNSF IIS - 1319578: III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.description.sponsorshipShivam Maharshien
dc.description.sponsorshipSunshin Leeen
dc.description.sponsorshipEdward Foxen
dc.identifier.urihttp://hdl.handle.net/10919/70932en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/us/en
dc.subjectWikipediaen
dc.subjectHadoop Clusteren
dc.subjectSolren
dc.subjectXMLen
dc.subjectAvroen
dc.subjectApacheen
dc.titleEnglish Wikipedia on Hadoop Clusteren
dc.typeDataseten
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
wikipedia_hadoop.zip
Size:
352.38 MB
Format:
Description:
Project materials and prototype
Loading...
Thumbnail Image
Name:
CS4624WikipediaHadoopPresentation.pdf
Size:
240.72 KB
Format:
Adobe Portable Document Format
Description:
Report Presentation PDF
Name:
CS4624WikipediaHadoopPresentation.pptx
Size:
288.53 KB
Format:
Microsoft Powerpoint XML
Description:
Report Presentation PPTX
Name:
CS4624WikipediaHadoopReport.docx
Size:
653.36 KB
Format:
Microsoft Word XML
Description:
Final Report DOCX
Loading...
Thumbnail Image
Name:
CS4624WikipediaHadoopReport.pdf
Size:
967.21 KB
Format:
Adobe Portable Document Format
Description:
Final Report PDF
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: