Reducing Noise for IDEAL

Wang, Xiangwen; Chandrasekar, Prashant

Reducing Noise for IDEAL

dc.contributor.author	Wang, Xiangwen	en
dc.contributor.author	Chandrasekar, Prashant	en
dc.date.accessioned	2015-05-15T04:06:13Z	en
dc.date.available	2015-05-15T04:06:13Z	en
dc.date.issued	2015-05-12	en
dc.description.abstract	The corpora for which we are building an information retrieval system consists of tweets and web pages (extracted from URL links that might be included in the tweets) that have been selected based on rudimentary string matching provided by the Twitter API. As a result, the corpora are inherently noisy and contain a lot of irrelevant information. This includes documents that are non-English, off topic articles and other information within them such as: stop-words, whitespace characters, non-alphanumeric characters, icons, broken links, HTML/XML tags, scripting codes, CSS style sheets, etc. In our attempt to build an efficient information retrieval system for events, through Solr, we are devising a matching system for the corpora by adding various facets and other properties to serve as dimensions for each document. These dimensions function as additional criteria that will enhance the matching and thereby the retrieval mechanism of Solr. They are metadata from classification, clustering, named-entities, topic modeling and social graph scores implemented by other teams in the class. It is of utmost importance that each of these initiatives is precise to ensure the enhancement of the matching and retrieval system. The quality of their work is dependent directly or indirectly on the quality of data that is provided to them. Noisy data will skew the results and each team would need to perform additional tasks to get rid of it prior to executing their core functionalities. It is our role and responsibility to remove irrelevant content or “noisy data” from the corpora. For both tweets and web pages, we cleaned entries that were written in English and discarded the rest. For tweets, we first extracted user handle information, URLs, and hashtags. We cleaned up the tweet text by removing non-ASCII character sequences and standardized the text using case folding, stemming and stop word removal. For the scope of this project, we considered cleaning only HTML formatted web pages and entries written in plain text file format. All other entries (or documents) such as videos, images, etc. were discarded. For the “valid” entries, we extracted the URLs within the web pages to enumerate the outgoing links. Using the Python package readability, we were able to clean advertisement, header and footer content. We were able to organize the remaining content and extract the article text using another Python package beatifulsoup4. We completed the cleanup by standardizing the text by removing non-ASCII characters, stemming, stop word removal and case folding. As a result, 14 tweet collections and 9 web pages collections were cleaned and indexed into Solr for retrieval.	en
dc.description.sponsorship	NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL)	en
dc.identifier.uri	http://hdl.handle.net/10919/52340	en
dc.language.iso	en_US	en
dc.rights	Creative Commons CC0 1.0 Universal Public Domain Dedication	en
dc.rights.uri	http://creativecommons.org/publicdomain/zero/1.0/	en
dc.subject	Information Retrieval	en
dc.subject	CS5604	en
dc.subject	Noise Reduction	en
dc.subject	Natural Language Processing	en
dc.title	Reducing Noise for IDEAL	en
dc.type	Presentation	en
dc.type	Software	en
dc.type	Technical report	en