Rdoc2vec CS4624 Project for Spring 2017

TR Number

Date

2017-04-28

Journal Title

Journal ISSN

Volume Title

Publisher

Virginia Tech

Abstract

This submission includes deliverables for the capstone project Rdoc2vec. It was created by Jake Clark, Austin Cooke, Steven Rolph, and Stephen Sherrard for their client, Eastman Chemical Corporation. Doc2Vec is a machine learning model to create a vector space whose elements are words from a grouping or several groupings of text. By analyzing several documents, all of the words which occur in these documents are placed into the vector space. The distance between these vectors indicates how similar they are. Words which appear in similar contexts have a small distance between them in this vector space. This algorithm has been used by researchers for document analysis, primarily using the Gensim Python library. Our client, Eastman Chemical Corporation, would like to use this approach, when working in a language more suited to their business model. A lot of their software is statistical, written in R. Thus, our job had the following components: become familiar with Doc2vec and R, develop Rdoc2vec, and apply it to parse documents, create a vector space, and make tests. First, to become familiar with the language, we spent a few weeks with tutorials including the Lynda library, which was provided by Virginia Tech. After we felt we were familiar with the language, we learned about two of the dominant algorithms used, called Distributed Bag-of-words (DBOW) and Distributed Memory (DM). After learning these two algorithms, we felt that we were prepared to begin development. Second, we developed a class structure similar to that of Gensim. Keeping this as a skeleton, we developed a parsing algorithm which would be used to train the model. The parser analyzes the documents and computes a frequency for the occurrence of each word. The parser itself takes a list of physical documents stored on the system and completes the analysis, passing the frequency of words along the pipeline. The next step was to create a neural network for training the model. We elected to use the built-in neural network library written in R called nnet. A neural network takes an initial input vector as a parameter. For our purposes, it made sense to use a “one-hot” vector, which has only one input. This can cut down on later calculation because the input vector is only of size one. Then this input is multiplied by several weights to be put into a hidden layer, handled by the nnet library. The values in the hidden layer are multiplied again by several weights to go into the output layer. After creating functions which called the nnet library, we began work on testing. In the meantime, we decided to begin a design on our own implementation of a neural network. By creating a neural network anew, we get around the major problem with nnet, which is optimization. Since nnet is a black box that we cannot affect, we cannot be sure that it is optimized for our application. Since we use “one-hot” vectors, which are not a default application, it is likely that there is some way we can improve the speed in our library. We were not able to finish and test our neural net, so it is something left for future groups to work on. Finally, we began testing. We created a Web scraper which grabbed a number of articles from Wikipedia. We used this scraper to get a number of different documents. Specifically, we scraped information on the congressional districts of several states. This gave us document sets which can be quite large when using several states, or smaller by analyzing individual states. We performed tests on these datasets, the results of which we kept with our code.

Description

Keywords

R, Doc2Vec, Machine Learning, Software, Open Source

Citation