Collection Management for IDEAL

Ma, Yufeng; Nan, Dong

Collection Management for IDEAL

dc.contributor.author	Ma, Yufeng	en
dc.contributor.author	Nan, Dong	en
dc.date.accessioned	2016-05-07T19:51:14Z	en
dc.date.available	2016-05-07T19:51:14Z	en
dc.date.issued	2016-05-04	en
dc.description	This submission describes the work of the Collection Management team as part of the IDEAL project with the main goal of designing and developing a distributed search engine. It includes the project term reports, final presentation slides, as well as source code and dataset developed. The main responsibility of our team was to do incremental update from MySQL database to HBase, conduct noise reduction on raw tweets and webpages, and finally perform named entity recognition on cleaned data.	en
dc.description.abstract	The collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names. First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data. Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.	en
dc.description.sponsorship	NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.identifier.uri	http://hdl.handle.net/10919/70930	en
dc.language.iso	en_US	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Information Retrieval	en
dc.subject	Noise Reduction	en
dc.subject	Incremental Update	en
dc.subject	Named Entity Recognition	en
dc.subject	Tweet	en
dc.subject	Webpage	en
dc.title	Collection Management for IDEAL	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en