Automatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Text

dc.contributor.authorPereira, Dennis V.en
dc.contributor.committeechairEgyhazy, Csaba J.en
dc.contributor.committeememberBelli, Gabriella M.en
dc.contributor.committeememberFrakes, William B.en
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2011-08-06T16:06:15Zen
dc.date.adate2004-09-02en
dc.date.available2011-08-06T16:06:15Zen
dc.date.issued1999-05-07en
dc.date.rdate2004-09-02en
dc.date.sdate2004-08-24en
dc.description.abstractWith the growing number of textual resources available, the ability to understand them becomes critical. An essential first step in understanding these sources is the ability to identify the parts-of-speech in each sentence. The goal of this research is to propose, improve, and implement an algorithm capable of finding terms (words in a corpus) that are used in similar ways--a term categorizer. Such a term categorizer can be used to find a particular part-of-speech, i.e. nouns in a corpus, and generate a lexicon. The proposed work is not dependent on any external sources of information, such as dictionaries, and it shows a significant improvement (~30%) over an existing method of categorization. More importantly, the proposed algorithm can be applied as a component of an unsupervised part-of-speech tagger, making it truly unsupervised, requiring only unannotated text. The algorithm is discussed in detail, along with its background, and its performance. Experimentation shows that the proposed algorithm performs within 3% of the baseline, the Penn-TreeBank Lexicon.en
dc.description.degreeMaster of Scienceen
dc.format.mediumETDen
dc.identifier.otheretd-08242004-012316en
dc.identifier.sourceurlhttp://scholar.lib.vt.edu/theses/available/etd-08242004-012316en
dc.identifier.urihttp://hdl.handle.net/10919/10094en
dc.publisherVirginia Techen
dc.relation.haspart08242004_Dennis_Pereira_ETD.pdfen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectlexicon generationen
dc.subjectpart-of-speechen
dc.subjectterm categorizationen
dc.subjectlexiconen
dc.subjectautomaticen
dc.titleAutomatic Lexicon Generation for Unsupervised Part-of-Speech Tagging Using Only Unannotated Texten
dc.typeThesisen
thesis.degree.disciplineComputer Scienceen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.levelmastersen
thesis.degree.nameMaster of Scienceen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
08242004_Dennis_Pereira_ETD.pdf
Size:
1.52 MB
Format:
Adobe Portable Document Format

Collections