Collection Management for IDEAL

dc.contributor.authorMa, Yufengen
dc.contributor.authorNan, Dongen
dc.date.accessioned2016-05-07T19:51:14Zen
dc.date.available2016-05-07T19:51:14Zen
dc.date.issued2016-05-04en
dc.descriptionThis submission describes the work of the Collection Management team as part of the IDEAL project with the main goal of designing and developing a distributed search engine. It includes the project term reports, final presentation slides, as well as source code and dataset developed. The main responsibility of our team was to do incremental update from MySQL database to HBase, conduct noise reduction on raw tweets and webpages, and finally perform named entity recognition on cleaned data.en
dc.description.abstractThe collection management portion of the information retrieval system has three major tasks. The first task is to perform incremental update of the new data flow from the tweet MySQL database to HDFS and then to HBase. Secondly, for the raw tweets coming into HBase, we are supposed to clean them. Duplicated URLs should be discarded. Also important is to conduct noise reduction. Finally, for the cleaned tweets and webpages, we should do Named Entity Recognition (NER), from which we extract out the information like person, organization, and location names. First, based on existing data flow from the tweet MySQL database to HBase in the IDEAL system, we developed a Sqoop script to import new tweets from MySQL to HDFS. Then another Pig script is run to transfer them into HBase. Afterwards, for raw tweets in HBase, we run a noise reduction module to remove non-ASCII characters, extract hashtags, mentions and URLs from tweet text. Similar procedures were also performed for raw webpage records provided by the GRAs for this project. All the cleaned data for the 6 small collections have been uploaded into HBase with pre-defined schemas documented in this report. Then all the other teams like classification and clustering can consume our cleaned data. Besides what has been done so far, it is desirable to do NER, which tries to extract structured information such as person, organization and location from unstructured text. But due to time limitations, this must be relegated to future work. Also needed is automating the webpage crawling and cleaning processes, which are essential after incremental update. That would expand URLs extracted from tweets in HBase first, and then crawl the corresponding webpages after invalid URL removal. Finally, extracted useful information in webpages would be stored into HBase.en
dc.description.sponsorshipNSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/70930en
dc.language.isoen_USen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectInformation Retrievalen
dc.subjectNoise Reductionen
dc.subjectIncremental Updateen
dc.subjectNamed Entity Recognitionen
dc.subjectTweeten
dc.subjectWebpageen
dc.titleCollection Management for IDEALen
dc.typePresentationen
dc.typeSoftwareen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 5
Name:
CollectionManagementCode&Data.zip
Size:
188.91 MB
Format:
Description:
Source Code and Data
Name:
CollectionManagementPresentation.pptx
Size:
10.61 MB
Format:
Microsoft Powerpoint XML
Description:
Presentation Slides (PowerPoint)
Loading...
Thumbnail Image
Name:
CollectionManagementPresentation.pdf
Size:
1.51 MB
Format:
Adobe Portable Document Format
Description:
Presentation Slides (PDF)
Name:
CollectionManagementReport.docx
Size:
6.83 MB
Format:
Microsoft Word XML
Description:
Term Report (Word)
Loading...
Thumbnail Image
Name:
CollectionManagementReport.pdf
Size:
2.52 MB
Format:
Adobe Portable Document Format
Description:
Term Report (PDF)
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: