Collection Management for IDEAL
The collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names.
First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data.
Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.