Browsing by Author "Wang, Xiangwen"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Heavy Tails and Anomalous Diffusion in Human Online DynamicsWang, Xiangwen (Virginia Tech, 2019-02-28)In this dissertation, I extend the analysis of human dynamics to human movements in online activities. My work starts with a discussion of the human information foraging process based on three large collections of empirical search click-through logs collected in different time periods. With the analogy of viewing the click-through on search engine result pages as a random walk, a variety of quantities like the distributions of step length and waiting time as well as mean-squared displacements, correlations and entropies are discussed. Notable differences between the different logs reveal an increased efficiency of the search engines, which is found to be related to the vanishing of the heavy-tailed characteristics of step lengths in newer logs as well as the switch from superdiffusion to normal diffusion in the diffusive processes of the random walks. In the language of foraging, the newer logs indicate that online searches overwhelmingly yield local searches, whereas for the older logs the foraging processes are a combination of local searches and relocation phases that are power-law distributed. The investigation highlights the presence of intermittent search processes in online searches, where phases of local explorations are separated by power-law distributed relocation jumps. In the second part of this dissertation I focus on an in-depth analysis of online gambling behaviors. For this analysis the collected empirical gambling logs reveal the wide existence of heavy-tailed statistics in various quantities in different online gambling games. For example, when players are allowed to choose arbitrary bet values, the bet values present log-normal distributions, meanwhile if they are restricted to use items as wagers, the distribution becomes truncated power laws. Under the analogy of viewing the net change of income of each player as a random walk, the mean-squared displacement and first-passage time distribution of these net income random walks both exhibit anomalous diffusion. In particular, in an online lottery game the mean-squared displacement presents a crossover from a superdiffusive to a normal diffusive regime, which is reproduced using simulations and explained analytically. This investigation also reveals the scaling characteristics and probability reweighting in risk attitude of online gamblers, which may help to interpret behaviors in economic systems. This work was supported by the US National Science Foundation through grants DMR-1205309 and DMR-1606814.
- Photo-based Vendor Re-identification on Darknet Marketplaces using Deep Neural NetworksWang, Xiangwen (Virginia Tech, 2018)Darknet markets are online services behind Tor where cybercriminals trade illegal goods and stolen datasets. In recent years, security analysts and law enforcement start to investigate the darknet markets to study the cybercriminal networks and predict future incidents. However, vendors in these markets often create multiple accounts (i.e., Sybils), making it challenging to infer the relationships between cybercriminals and identify coordinated crimes. In this thesis, we present a novel approach to link the multiple accounts of the same darknet vendors through photo analytics. The core idea is that darknet vendors often have to take their own product photos to prove the possession of the illegal goods, which can reveal their distinct photography styles. To fingerprint vendors, we construct a series deep neural networks to model the photography styles. We apply transfer learning to the model training, which allows us to accurately fingerprint vendors with a limited number of photos. We evaluate the system using real-world datasets from 3 large darknet markets (7,641 vendors and 197,682 product photos). A ground-truth evaluation shows that the system achieves an accuracy of 97.5%, outperforming existing stylometry-based methods in both accuracy and coverage. In addition, our system identifies previously unknown Sybil accounts within the same markets (23) and across different markets (715 pairs). Further case studies reveal new insights into the coordinated Sybil activities such as price manipulation, buyer scam, and product stocking and reselling.
- Reducing Noise for IDEALWang, Xiangwen; Chandrasekar, Prashant (2015-05-12)The corpora for which we are building an information retrieval system consists of tweets and web pages (extracted from URL links that might be included in the tweets) that have been selected based on rudimentary string matching provided by the Twitter API. As a result, the corpora are inherently noisy and contain a lot of irrelevant information. This includes documents that are non-English, off topic articles and other information within them such as: stop-words, whitespace characters, non-alphanumeric characters, icons, broken links, HTML/XML tags, scripting codes, CSS style sheets, etc. In our attempt to build an efficient information retrieval system for events, through Solr, we are devising a matching system for the corpora by adding various facets and other properties to serve as dimensions for each document. These dimensions function as additional criteria that will enhance the matching and thereby the retrieval mechanism of Solr. They are metadata from classification, clustering, named-entities, topic modeling and social graph scores implemented by other teams in the class. It is of utmost importance that each of these initiatives is precise to ensure the enhancement of the matching and retrieval system. The quality of their work is dependent directly or indirectly on the quality of data that is provided to them. Noisy data will skew the results and each team would need to perform additional tasks to get rid of it prior to executing their core functionalities. It is our role and responsibility to remove irrelevant content or “noisy data” from the corpora. For both tweets and web pages, we cleaned entries that were written in English and discarded the rest. For tweets, we first extracted user handle information, URLs, and hashtags. We cleaned up the tweet text by removing non-ASCII character sequences and standardized the text using case folding, stemming and stop word removal. For the scope of this project, we considered cleaning only HTML formatted web pages and entries written in plain text file format. All other entries (or documents) such as videos, images, etc. were discarded. For the “valid” entries, we extracted the URLs within the web pages to enumerate the outgoing links. Using the Python package readability, we were able to clean advertisement, header and footer content. We were able to organize the remaining content and extract the article text using another Python package beatifulsoup4. We completed the cleanup by standardizing the text by removing non-ASCII characters, stemming, stop word removal and case folding. As a result, 14 tweet collections and 9 web pages collections were cleaned and indexed into Solr for retrieval.