VTechWorks staff will be away for the Thanksgiving holiday beginning at noon on Wednesday, November 27, through Friday, November 29. We will resume normal operations on Monday, December 2. Thank you for your patience.
 

Geo-Locating Tweets with Latent Location Information

dc.contributor.authorLee, Sunshinen
dc.contributor.committeechairFox, Edward A.en
dc.contributor.committeememberFan, Weiguoen
dc.contributor.committeememberLee, Hwajungen
dc.contributor.committeememberEhrich, Roger W.en
dc.contributor.committeememberSforza, Peter M.en
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2017-02-14T09:00:35Zen
dc.date.available2017-02-14T09:00:35Zen
dc.date.issued2017-02-13en
dc.description.abstractAs part of our work on the NSF funded Integrated Digital Event Archiving and Library (IDEAL) project and the Global Event and Trend Archive Research (GETAR) project, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geo-locate those tweets whenever possible. Due to possible location ambiguity, finding a tweet's location often is challenging. Many distinct places have the same geoname, e.g., "Greenville" matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, "Water main break near White House" does not have any geonames but it is related to a location "1600 Pennsylvania Ave NW, Washington, DC 20500 USA" indicated by the key phrase 'White House'. To disambiguate tweet locations, we first extracted geospatial named entities (geonames) and predicted implicit state (e.g., Virginia or California) information from entities using machine learning algorithms including Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest (RF). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a machine learning (ML) approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford NER. Adding state information predicted by our classifiers increased the possibility to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (3 mid-size and 2 big-size collections about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geo-coordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding API. Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach with geonames and all LIWs that are present in tweet texts as features.en
dc.description.abstractgeneralAs part of our work on the projects “Integrated Digital Event Archiving and Library (IDEAL)” and “Global Event and Trend Archive Research (GETAR),” funded by NSF, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geolocate those tweets whenever possible. Due to possible location ambiguity, finding a tweet’s location often is challenging. Many distinct places have the same geoname, e.g., “Greenville” matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, “Water main break near White House” does not have any geonames but it is related to a location “1600 Pennsylvania Ave NW, Washington, DC 20500 USA” indicated by the key phrase ‘White House’. To disambiguate tweet locations, we first extracted geonames, and then predicted implicit state (e.g., Virginia or California) information from entities using machine learning (ML) algorithms (wherein computers learn from examples what state is appropriate). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a ML approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford Named Entity Recognizer. Adding state information predicted by our classifiers increased the ability to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (in three mid-size and two big collections, about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geocoordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding Web application (that converts addresses into geographic coordinates). Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach wherein the features considered are the geonames and all LIWs that are present in the tweet texts.en
dc.description.degreePh. D.en
dc.format.mediumETDen
dc.identifier.othervt_gsexam:9573en
dc.identifier.urihttp://hdl.handle.net/10919/75022en
dc.publisherVirginia Techen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectClassificationen
dc.subjectEventsen
dc.subjectGeo-codingen
dc.subjectGeo-locatingen
dc.subjectGeo-parsingen
dc.subjectGoogle Geo-coding APIen
dc.subjectHadoop clusteren
dc.subjectIntegrated Digital Event Archiving and Library (IDEAL)en
dc.subjectLocation Indicative Words (LIWs)en
dc.subjectMachine learningen
dc.subjectNaïve Bayesen
dc.subjectNamed Entity Recognitionen
dc.subjectNatural Language Processingen
dc.titleGeo-Locating Tweets with Latent Location Informationen
dc.typeDissertationen
thesis.degree.disciplineComputer Science and Applicationsen
thesis.degree.grantorVirginia Polytechnic Institute and State Universityen
thesis.degree.leveldoctoralen
thesis.degree.namePh. D.en

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lee_S_D_2017.pdf
Size:
12.16 MB
Format:
Adobe Portable Document Format