Virginia Tech
    • Log in
    View Item 
    •   VTechWorks Home
    • Student Works
    • CS4624: Multimedia, Hypertext, and Information Access
    • View Item
    •   VTechWorks Home
    • Student Works
    • CS4624: Multimedia, Hypertext, and Information Access
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Rdoc2vec CS4624 Project for Spring 2017

    Thumbnail
    View/Open
    r_doc2vec-master.zip (300.0Kb)
    Downloads: 86
    Rdoc2vecPresentation.pdf (1.779Mb)
    Downloads: 224
    Rdoc2vecReport.pdf (1.039Mb)
    Downloads: 9906
    Rdoc2vecReport.docx (1.302Mb)
    Downloads: 167
    Rdoc2vecPresentation.pptx (2.312Mb)
    Downloads: 92
    Date
    2017-04-28
    Author
    Cooke, Austin
    Clark, Jake
    Rolph, Steven
    Sherrard, Stephen
    Metadata
    Show full item record
    Abstract
    This submission includes deliverables for the capstone project Rdoc2vec. It was created by Jake Clark, Austin Cooke, Steven Rolph, and Stephen Sherrard for their client, Eastman Chemical Corporation. Doc2Vec is a machine learning model to create a vector space whose elements are words from a grouping or several groupings of text. By analyzing several documents, all of the words which occur in these documents are placed into the vector space. The distance between these vectors indicates how similar they are. Words which appear in similar contexts have a small distance between them in this vector space. This algorithm has been used by researchers for document analysis, primarily using the Gensim Python library. Our client, Eastman Chemical Corporation, would like to use this approach, when working in a language more suited to their business model. A lot of their software is statistical, written in R. Thus, our job had the following components: become familiar with Doc2vec and R, develop Rdoc2vec, and apply it to parse documents, create a vector space, and make tests. First, to become familiar with the language, we spent a few weeks with tutorials including the Lynda library, which was provided by Virginia Tech. After we felt we were familiar with the language, we learned about two of the dominant algorithms used, called Distributed Bag-of-words (DBOW) and Distributed Memory (DM). After learning these two algorithms, we felt that we were prepared to begin development. Second, we developed a class structure similar to that of Gensim. Keeping this as a skeleton, we developed a parsing algorithm which would be used to train the model. The parser analyzes the documents and computes a frequency for the occurrence of each word. The parser itself takes a list of physical documents stored on the system and completes the analysis, passing the frequency of words along the pipeline. The next step was to create a neural network for training the model. We elected to use the built-in neural network library written in R called nnet. A neural network takes an initial input vector as a parameter. For our purposes, it made sense to use a “one-hot” vector, which has only one input. This can cut down on later calculation because the input vector is only of size one. Then this input is multiplied by several weights to be put into a hidden layer, handled by the nnet library. The values in the hidden layer are multiplied again by several weights to go into the output layer. After creating functions which called the nnet library, we began work on testing. In the meantime, we decided to begin a design on our own implementation of a neural network. By creating a neural network anew, we get around the major problem with nnet, which is optimization. Since nnet is a black box that we cannot affect, we cannot be sure that it is optimized for our application. Since we use “one-hot” vectors, which are not a default application, it is likely that there is some way we can improve the speed in our library. We were not able to finish and test our neural net, so it is something left for future groups to work on. Finally, we began testing. We created a Web scraper which grabbed a number of articles from Wikipedia. We used this scraper to get a number of different documents. Specifically, we scraped information on the congressional districts of several states. This gave us document sets which can be quite large when using several states, or smaller by analyzing individual states. We performed tests on these datasets, the results of which we kept with our code.
    URI
    http://hdl.handle.net/10919/77622
    Collections
    • CS4624: Multimedia, Hypertext, and Information Access [229]

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us
     

     

    VTechWorks

    AboutPoliciesHelp

    Browse

    All of VTechWorksCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Log inRegister

    Statistics

    View Usage Statistics

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us