Common Crawl Mining

TR Number
Date
2017-05-10
Journal Title
Journal ISSN
Volume Title
Publisher
Virginia Tech
Abstract

The main goal behind the Common Crawl Mining system is to improve Eastman Chemical Company’s ability to use timely knowledge of public concerns to inform key business decisions. It provides information to Eastman Chemical Company that is valuable for consumer chemical product marketing and strategy development. Eastman desired a system that provides insight into the current chemical landscape. Information about trends and sentiment towards chemicals over time is beneficial to their marketing and strategy departments. They wanted to be able to drill down to a particular time period and look at what people were writing about certain keywords.

This project provides such information through a search interface. The interface accepts chemical names and search term keywords as input and responds with a list of web page records that match those keywords. Included within each record returned is the probable publication date of the page, a score relating the page to the given keywords, and the body text extracted from the page. Though it was one of the stretch goals of the project, the current iteration of the Common Crawl Mining system does not provide sentiment analysis. It would be relatively straightforward to extend the system to perform it, given the appropriate training data.

The final Common Crawl Mining system is a search engine implemented using Elasticsearch. Relevant records are identified by first analyzing Common Crawl for Web Archive (WARC) files that have a high frequency of records from interesting domains. Records with publication dates are then ingested into the search engine. Once the records have been indexed by Elasticsearch, users are able to execute searches which return a list of relevant records. Each record contains the URl, text, and publication date of the associated webpage.

Included in this submission are Microsoft Office and PDF versions of the Common Crawl Mining project's final presentation and final report. The final presentation outlines the project's history. The final report outlines the progress made on the project and includes a developer's and user's manual for the system. This submission also includes a compressed folder which contains all of the source code associated with the Common Crawl Mining project.

Description
Keywords
Common Crawl, Elasticsearch, Content Mining, Eastman Chemical Company
Citation