Geo-Locating Tweets with Latent Location Information

Lee, Sunshin

Geo-Locating Tweets with Latent Location Information

dc.contributor.author	Lee, Sunshin	en
dc.contributor.committeechair	Fox, Edward A.	en
dc.contributor.committeemember	Fan, Weiguo	en
dc.contributor.committeemember	Lee, Hwajung	en
dc.contributor.committeemember	Ehrich, Roger W.	en
dc.contributor.committeemember	Sforza, Peter M.	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2017-02-14T09:00:35Z	en
dc.date.available	2017-02-14T09:00:35Z	en
dc.date.issued	2017-02-13	en
dc.description.abstract	As part of our work on the NSF funded Integrated Digital Event Archiving and Library (IDEAL) project and the Global Event and Trend Archive Research (GETAR) project, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geo-locate those tweets whenever possible. Due to possible location ambiguity, finding a tweet's location often is challenging. Many distinct places have the same geoname, e.g., "Greenville" matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, "Water main break near White House" does not have any geonames but it is related to a location "1600 Pennsylvania Ave NW, Washington, DC 20500 USA" indicated by the key phrase 'White House'. To disambiguate tweet locations, we first extracted geospatial named entities (geonames) and predicted implicit state (e.g., Virginia or California) information from entities using machine learning algorithms including Support Vector Machine (SVM), Naive Bayes (NB), and Random Forest (RF). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a machine learning (ML) approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford NER. Adding state information predicted by our classifiers increased the possibility to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (3 mid-size and 2 big-size collections about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geo-coordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding API. Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach with geonames and all LIWs that are present in tweet texts as features.	en
dc.description.abstractgeneral	As part of our work on the projects “Integrated Digital Event Archiving and Library (IDEAL)” and “Global Event and Trend Archive Research (GETAR),” funded by NSF, we collected over 1.4 billion tweets using over 1,000 keywords, key phrases, mentions, or hashtags, starting from 2009. Since many tweets talk about events (with useful location information), such as natural disasters, emergencies, and accidents, it is important to geolocate those tweets whenever possible. Due to possible location ambiguity, finding a tweet’s location often is challenging. Many distinct places have the same geoname, e.g., “Greenville” matches 50 different locations in the U.S.A. Frequently, in tweets, explicit location information, like geonames mentioned, is insufficient, because tweets are often brief and incomplete. They have a small fraction of the full location information of an event due to the 140 character limitation. Location indicative words (LIWs) may include latent location information, for example, “Water main break near White House” does not have any geonames but it is related to a location “1600 Pennsylvania Ave NW, Washington, DC 20500 USA” indicated by the key phrase ‘White House’. To disambiguate tweet locations, we first extracted geonames, and then predicted implicit state (e.g., Virginia or California) information from entities using machine learning (ML) algorithms (wherein computers learn from examples what state is appropriate). Implicit state information helps reduce ambiguity. We also studied how location information of events is expressed in tweets and how latent location indicative information can help to geo-locate tweets. We then used a ML approach to predict the implicit state using geonames and LIWs. We conducted experiments with tweets (e.g., about potholes), and found significant improvement in disambiguating tweet locations using a ML algorithm along with the Stanford Named Entity Recognizer. Adding state information predicted by our classifiers increased the ability to find the state-level geo-location unambiguously by up to 80%. We also studied over 6 million tweets (in three mid-size and two big collections, about water main breaks, sinkholes, potholes, car crashes, and car accidents), covering 17 months. We found that up to 91.1% of tweets have at least one type of location information (geocoordinates or geonames), or LIWs. We also demonstrated that in most cases adding LIWs helps geo-locate tweets with less ambiguity using a geo-coding Web application (that converts addresses into geographic coordinates). Finally, we conducted additional experiments with the five different tweet collections, and found significant improvement in disambiguating tweet locations using a ML approach wherein the features considered are the geonames and all LIWs that are present in the tweet texts.	en
dc.description.degree	Ph. D.	en
dc.format.medium	ETD	en
dc.identifier.other	vt_gsexam:9573	en
dc.identifier.uri	http://hdl.handle.net/10919/75022	en
dc.publisher	Virginia Tech	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	Classification	en
dc.subject	Events	en
dc.subject	Geo-coding	en
dc.subject	Geo-locating	en
dc.subject	Geo-parsing	en
dc.subject	Google Geo-coding API	en
dc.subject	Hadoop cluster	en
dc.subject	Integrated Digital Event Archiving and Library (IDEAL)	en
dc.subject	Location Indicative Words (LIWs)	en
dc.subject	Machine learning	en
dc.subject	Naïve Bayes	en
dc.subject	Named Entity Recognition	en
dc.subject	Natural Language Processing	en
dc.title	Geo-Locating Tweets with Latent Location Information	en
dc.type	Dissertation	en
thesis.degree.discipline	Computer Science and Applications	en
thesis.degree.grantor	Virginia Polytechnic Institute and State University	en
thesis.degree.level	doctoral	en
thesis.degree.name	Ph. D.	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Lee_S_D_2017.pdf
Size:: 12.16 MB
Format:: Adobe Portable Document Format

Download

Collections

Doctoral Dissertations