Common Crawl Mining

dc.contributor.authorDean, Tommyen
dc.contributor.authorPasha, Alien
dc.contributor.authorClarke, Brianen
dc.contributor.authorButenhoff, Casey J.en
dc.date.accessioned2017-05-14T18:58:47Zen
dc.date.available2017-05-14T18:58:47Zen
dc.date.issued2017-05-10en
dc.description.abstractThe main goal behind the Common Crawl Mining system is to improve Eastman Chemical Company’s ability to use timely knowledge of public concerns to inform key business decisions. It provides information to Eastman Chemical Company that is valuable for consumer chemical product marketing and strategy development. Eastman desired a system that provides insight into the current chemical landscape. Information about trends and sentiment towards chemicals over time is beneficial to their marketing and strategy departments. They wanted to be able to drill down to a particular time period and look at what people were writing about certain keywords. This project provides such information through a search interface. The interface accepts chemical names and search term keywords as input and responds with a list of web page records that match those keywords. Included within each record returned is the probable publication date of the page, a score relating the page to the given keywords, and the body text extracted from the page. Though it was one of the stretch goals of the project, the current iteration of the Common Crawl Mining system does not provide sentiment analysis. It would be relatively straightforward to extend the system to perform it, given the appropriate training data. The final Common Crawl Mining system is a search engine implemented using Elasticsearch. Relevant records are identified by first analyzing Common Crawl for Web Archive (WARC) files that have a high frequency of records from interesting domains. Records with publication dates are then ingested into the search engine. Once the records have been indexed by Elasticsearch, users are able to execute searches which return a list of relevant records. Each record contains the URl, text, and publication date of the associated webpage. Included in this submission are Microsoft Office and PDF versions of the Common Crawl Mining project's final presentation and final report. The final presentation outlines the project's history. The final report outlines the progress made on the project and includes a developer's and user's manual for the system. This submission also includes a compressed folder which contains all of the source code associated with the Common Crawl Mining project.en
dc.description.notesccm_source_code.zip: All of the source code associated with the Common Crawl Mining project in a compressed ZIP folder. ccm_final_presentation.docx: The final Common Crawl Mining project presentation in Microsoft Word format. ccm_final_presentation.pdf: The final Common Crawl Mining project presentation in PDF format. ccm_final_report.docx: The final Common Crawl Mining project report in Microsoft Word format. ccm_final_report.pdf: The final Common Crawl Mining project report in PDF format. Note: There are some mismatches in the report between its page numbers and those in the table of contents, but it should be easy to work around those.en
dc.description.sponsorshipEastman Chemical Companyen
dc.identifier.urihttp://hdl.handle.net/10919/77629en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsCreative Commons CC0 1.0 Universal Public Domain Dedicationen
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/en
dc.subjectCommon Crawlen
dc.subjectElasticsearchen
dc.subjectContent Miningen
dc.subjectEastman Chemical Companyen
dc.titleCommon Crawl Miningen
dc.typePresentationen
dc.typeReporten
dc.typeSoftwareen

Files

Original bundle
Now showing 1 - 5 of 5
Name:
ccm_source_code.zip
Size:
3.33 MB
Format:
Loading...
Thumbnail Image
Name:
ccm_final_presentation.pdf
Size:
136.35 KB
Format:
Adobe Portable Document Format
Name:
ccm_final_presentation.pptx
Size:
72.7 KB
Format:
Microsoft Powerpoint XML
Name:
ccm_final_report.docx
Size:
3.27 MB
Format:
Microsoft Word XML
Loading...
Thumbnail Image
Name:
ccm_final_report.pdf
Size:
3.12 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: