Evaluation of Word and Paragraph Embeddings and Analogical Reasoning as an  Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of Biocuration

dc.contributor.authorSullivan, Daniel Edwarden
dc.contributor.committeechairWattam, Alice Rebeccaen
dc.contributor.committeememberBevan, David R.en
dc.contributor.committeememberHoops, Stefanen
dc.contributor.committeememberMarathe, Madhav Vishnuen
dc.contributor.departmentAnimal and Poultry Sciencesen
dc.date.accessioned2017-11-30T07:00:20Zen
dc.date.available2017-11-30T07:00:20Zen
dc.date.issued2016-06-07en
dc.description.abstractThis research addresses the problem, can unsupervised learning generate a representation that improves on the commonly used term frequency-inverse document frequency (TF-IDF ) representation by capturing semantic relations? The analysis measures the quality of sentence classification using term TF-IDF representations, and finds a practical upper limit to precision and recall in a biomedical text classification task (F1-score of 0.85). Arguably, one could use ontologies to supplement TF-IDF, but ontologies are sparse in coverage and costly to create. This prompts a correlated question: can unsupervised learning capture semantic relations at least as well as existing ontologies, and thus supplement existing sparse ontologies? A shallow neural network implementing the Skip-Gram algorithm is used to generate semantic vectors using a corpus of approximately 2.4 billion words. The ability to capture meaning is assessed by comparing semantic vectors generated with MESH. Results indicate that semantic vectors trained by unsupervised methods capture comparable levels of semantic features in some cases, such as amino acid (92% of similarity represented in MESH), but perform substantially poorer in more expansive topics, such as pathogenic bacteria (37.8% similarity represented in MESH). Possible explanations for this difference in performance are proposed along with a method to combine manually curated ontologies with semantic vector spaces to produce a more comprehensive representation than either alone. Semantic vectors are also used as representations for paragraphs, which, when used for classification, achieve an F1-score of 0.92. The results of classification and analogical reasoning tasks are promising but a formal model of semantic vectors, subject to the constraints of known linguistic phenomenon, is needed. This research includes initial steps for developing a formal model of semantic vectors based on a combination of linear algebra and fuzzy set theory subject to the semantic molecularism linguistic model. This research is novel in its analysis of semantic vectors applied to the biomedical domain, analysis of different performance characteristics in biomedical analogical reasoning tasks, comparison semantic relations captured by between vectors and MESH, and the initial development of a formal model of semantic vectors.en
dc.description.degreePh. D.en
dc.format.mediumETDen
dc.identifier.othervt_gsexam:7665en
dc.identifier.urihttp://hdl.handle.net/10919/80572en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjecttext miningen
dc.subjectMachine learningen
dc.subjectbiocurationen
dc.subjectlinguisticsen
dc.subjectnatural language processingen
dc.titleEvaluation of Word and Paragraph Embeddings and Analogical Reasoning as an  Alternative to Term Frequency-Inverse Document Frequency-based Classification in Support of Biocurationen
dc.typeDissertationen
thesis.degree.disciplineGenetics, Bioinformatics, and Computational Biologyen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sullivan_DE_D_2016.pdf
Size:
3.72 MB
Format:
Adobe Portable Document Format