Collection Management for IDEAL

TR Number
Journal Title
Journal ISSN
Volume Title

The collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names.

First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data.

Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.

This submission describes the work of the Collection Management team as part of the IDEAL project with the main goal of designing and developing a distributed search engine. It includes the project term reports, final presentation slides, as well as source code and dataset developed. The main responsibility of our team was to do incremental update from MySQL database to HBase, conduct noise reduction on raw tweets and webpages, and finally perform named entity recognition on cleaned data.
Information Retrieval, Noise Reduction, Incremental Update, Named Entity Recognition, Tweet, Webpage