Show simple item record

dc.contributor.authorMa, Yufeng
dc.contributor.authorNan, Dong
dc.date.accessioned2016-05-07T19:51:14Z
dc.date.available2016-05-07T19:51:14Z
dc.date.issued2016-05-04
dc.identifier.urihttp://hdl.handle.net/10919/70930
dc.descriptionThis submission describes the work of the Collection Management team as part of the IDEAL project with the main goal of designing and developing a distributed search engine. It includes the project term reports, final presentation slides, as well as source code and dataset developed. The main responsibility of our team was to do incremental update from MySQL database to HBase, conduct noise reduction on raw tweets and webpages, and finally perform named entity recognition on cleaned data.en_US
dc.description.abstractThe collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names. First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data. Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.en_US
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en_US
dc.language.isoen_USen_US
dc.rightsCC0 1.0 Universal*
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/*
dc.subjectInformation Retrievalen_US
dc.subjectNoise Reductionen_US
dc.subjectIncremental Updateen_US
dc.subjectNamed Entity Recognitionen_US
dc.subjectTweeten_US
dc.subjectWebpageen_US
dc.titleCollection Management for IDEALen_US
dc.typePresentationen_US
dc.typeSoftwareen_US
dc.typeTechnical reporten_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal