Browsing by Author "Miller, Chreston"
Now showing 1 - 12 of 12
Results Per Page
Sort Options
- Big Data Text Summarization for the NeverAgain MovementArora, Anuj; Miller, Chreston; Fan, Jixiang; Liu, Shuai; Han, Yi (Virginia Tech, 2018-12-10)When you are browsing social media websites such as Twitter and Facebook, have you ever seen hashtags like #NeverAgain and #EnoughIsEnough? Do you know what they mean? Never Again is an American student-led political movement for gun control to prevent gun violence. In the United States, gun control has long been debated. According to the data from the Gun Violence Archive (http://www.shootingtracker.com/), in 2017, the U.S. saw a total of 346 mass shootings. Supporters claim that the proliferation of firearms is the direct spark of a series of social unrest factors such as robbery, sexual crimes, and theft, while others believe the gun culture represents an integral part of their freedom. For the Never Again Gun Control Movement, we would like to generate a human readable summary based on deep learning methods so that one can study incidents of gun violence that shocked the world such as the 2017 Las Vegas shooting, in order to figure out the impact of gun proliferation. Our project includes three steps: pre-processing, topic modeling, and abstractive summarization using deep learning. We began with a large collection of news articles associated with the #NeverAgain movement. The raw news articles needed to be pre-processed in multiple ways. An ArchiveSpark script was used to convert the WARC and CDX files to a readable and parseable JSON. However, we figured out that at least forty percent of the data was noise. A series of restrictive word filters was applied to remove noise. After noise removal, we identified the most frequent words to get a preliminary idea whether we were filtering noise properly. We used the Natural Language Toolkit’s (NLTK) Named Entity chunker to generate named entities, which are phrases that form important nouns (people, places, organizations, etc.) in a sentence. For Topic Modeling, we classified sentences into different buckets or topics, which identified distinct themes in the collection. While we were performing the dictionary creation and document vectorization, the Latent Dirichlet allocation algorithm (for topic modeling) did not take the normalized and tokenized word corpus directly. It had to be converted into a vector for each article in the collection. We chose to use the Bag Of Words (BOW) approach. The Bag Of Words method is a simplifying representation used in natural language processing and information retrieval. In this model, text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order, but keeping multiplicity. According to topic modeling, we needed to choose the number of topics, which means one must guess how many topics are present in a collection. There is no foolproof way of replacing human logic to weave keywords into topics with semantic meaning. To address this we tried the coherence score approach. Coherence score is an attempt to mimic the human readability of the topic, and the higher the coherence score, the more ”coherent” the topics are considered. The last step for topic modeling is Latent Dirichlet Allocation (LDA). Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. Compared with some other algorithms, LDA is a probabilistic one, which means that LDA is better at handling topic mixtures in different documents. In addition, LDA identifies topics coherently whereas the topics from other algorithms are more disjoint. After we had our topics (three in total), we filtered the article collection based on these topics. What resulted was three distinct collections of articles on which we could apply an abstractive summarization algorithm to produce a coherent summary. We chose to use a Pointer-Generator Network (PGN), a deep learning approach designed to create abstractive summaries, to produce said summaries. We created a summary for each identified topic and performed post-processing to produce one summary that connected the three topics (which are related) into a summary that flowed. The result was a summary that reflected the main themes of the article collection and informed the reader of the contents of said collection in less than two pages.
- Comparative Study and Expansion of Metadata Standards for Historic Fashion CollectionsNg, Wen Nie; Smith-Glaviana, Dina; Miller, Chreston; McIrvin, Caleb; Westblade, Julia (2023-06-25)The objective of this poster is to enhance the metadata standards applied in historic fashion collections. This is accomplished by expanding the controlled vocabulary and metadata elements to encompass the Costume Core and rectify any inadequacies. To achieve this goal, several methods are employed, including the incorporation of new descriptive terms to enable the precise description of artifacts during the re-cataloging of a university fashion collection in Costume Core. Additionally, new descriptors are identified through a technique called word embeddings, which involves using pre-trained natural language processing models to extract data from a conceptual latent space. Finally, crowdsourcing through surveys is conducted to gather insights into the usage of metadata for describing dress artifacts. Additionally, the presentation provides a preview of the Model Output Confirmative Helper Application, which streamlines the review process. It also highlights the commonly used metadata standards in the historic fashion industry, sample metadata supplied by respondents, and partial potential metadata to be appended to the Costume Core. As a result of the project, the expanded Costume Core is more comprehensive in describing fashion collections. It can be widely adopted by the fashion industry, promoting consistent metadata and increasing metadata interoperability.
- Creating Transparent Search and Discovery AlgorithmsMiller, Chreston (2019-04-12)
- Information Scraps in the Smartphone EraEllis, William Thomas (Virginia Tech, 2016-06-19)How people create and use information scraps, the small informal messages that people write to themselves to help them complete a task or remember something, has changed rapidly in the age of mobile computing. As recently as 2008, information scraps had continued to resist technological support. Since then, however, people have adopted mobile connected devices at a rate unimagined in the pre-smartphone era. Developers have, in turn, created a varied and growing body of smartphone software that supports many common information scrap use-cases. In this thesis, we describe our research into how and why people have adopted smartphone technology to serve their information scrap needs. The results of our survey show broad adoption of smartphones for many common information scrap tasks, particularly ones involving prospective memory. In addition, the results of our diary studies show that mobile contexts or locations are highly correlated with people's choosing to use smartphones to record information scraps. Our analysis of our diary study data also provides fresh understanding of the information scrap lifecycle and how mobile digital technology affects it. We find people's smartphone information scraps tend toward automatic archival, and we find their information scraps in general tend toward substantial role overlap regardless of medium. We use these findings to formulate a new information scrap lifecycle that is inclusive of mobile technology. These insights will help mobile technology creators to better support information scraps, which, in turn will allow users to enjoy the huge benefits of digital technology in their information scrap tasks.
- A Novel Approach to Modeling Contextual Privacy Preference and PracticeRadics, Peter Jozsef (Virginia Tech, 2016-09-27)We are living in a time of fundamental changes in the dynamic between privacy and surveillance. The ubiquity of information technology has changed the ways in which we interact, empowering us through new venues of communication and social intimacy. At the same time, it exposes us to the prying eyes of others, in the shape of governments, companies, or even fellow humans. This creates a challenging environment for the design of 'privacy-aware' applications, exacerbated by a disconnect between abstract knowledge of privacy and concrete information requirements of privacy design frameworks. In this work, we present a novel approach for the modeling of contextual privacy preference and practice. The process guides a 'privacy analyst' through the steps of evaluating, choosing, and deploying appropriate data collection strategies; the verification and validation of the collected data; and the systematic transformation of the dense, unstructured data into a structured domain model. We introduce the Privacy Domain Modeling Language (PDML) to address the representational needs of privacy domain models. Making use of the structure of PDML, we explore the applicability of the information theoretic concept 'entropy' to determine the completeness of the resulting model. We evaluate the utility of the process through its application to the evaluation and re-design of a web application for the management of students' directory information and education records. Through this case study, we demonstrate the potential for automation of the process through the Privacy Analyst Work eNvironment (PAWN) and show the process's seamless integration with existing privacy design frameworks. Finally, we provide evidence for the value of using entropy for determining model completeness, and provide an outlook on future work.
- On Utilization of Contributory Storage in Desktop GridsMiller, Chreston; Butler, Patrick; Shah, Ankur; Butt, Ali R. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)The availability of desktop grids and shared computing platforms has popularized the use of contributory resources, such as desktops, as computing substrates for a variety of applications. However, addressing the exponentially growing storage demands of applications, especially in a contributory environment, remains a challenging research problem. In this report, we propose a transparent distributed storage system that harnesses the storage contributed by grid participants arranged in a peer-to-peer network to yield a scalable, robust, and self-organizing system. The novelty of our work lies in (i) design simplicity to facilitate actual use; (ii) support for easy integration with grid platforms; (iii) ingenious use of striping and error coding techniques to support very large data files; and (iv) the use of multicast techniques for data replication. Experimental results through simulations and an actual implementation show that our system can provide reliable and efficient storage with large file support for desktop grid applications.
- The Open Science of Deep Learning: Three Case StudiesMiller, Chreston; Lahne, Jacob; Hamilton, Leah (2022-03)The open science movement, which prioritizes the open availability of research data and methods for public scrutiny and replication, includes practices like providing code implementing described algorithms in openly available publications. An area of research in which open-science principles may have particularly high impact is in deep learning, where researchers have developed a plethora of algorithms to solve complex and challenging problems, but where others may have difficulty in replicating results and applying these algorithms to other problems. In response, some researchers have begun to open up deep-learning research by making their code and resources available (e.g., datasets and/or pre-trained models) to the current and future research community. This presentation describes three case studies in deep learning where openly available resources differed and investigates the impact on the project and the outcome. This provides a venue for discussion on successes, lessons learned, and recommendations for future researchers facing similar situations, especially as deep learning increasingly becomes an important tool across disciplines. In the first case study, we present a workflow for text summarization, based on thousands of news articles. The outcome, generalizable to many situations, is a tool that can concisely report key facts and events from the articles. In the second case study, we describe the development of an Optical Character Recognition tool for archival research of physical typed notecards, in this case documenting an important, curated collection of thousands of items of clothing. In the last case study, we describe the workflow for applying common Natural Language Processing tools to a novel task: identifying descriptive language for whiskies from thousands of free-form text reviews. These case studies resulted in working solutions addressing their respective, challenging problems because of researchers embracing the concept of open science.
- The Open Science of Deep Learning: Three Case StudiesMiller, Chreston; Hamilton, Leah; Lahne, Jacob (2023-02-15)Objective: An area of research in which open science may have particularly high impact is in deep learning (DL), where researchers have developed many algorithms to solve challenging problems, but others may have difficulty in replicating results and applying these algorithms. In response, some researchers have begun to open up DL research by making their resources available (e.g., code, datasets and/or pre-trained models) to the research community. This article describes three case studies in DL where openly available resources are used and we investigate the impact on the projects, the outcomes, and make recommendations for what to focus on when making DL resources available. Methods: Each case study represents a single project using openly available DL resources for a research project. The process and progress of each case study is recorded along with aspects such as approaches taken, documentation of openly available resources, and researchers’ experience with the openly available resources. The case studies are in multiple-document text summarization, optical character recognition (OCR) of thousands of text documents, and identifying unique language descriptors for sensory science. Results: Each case study was a success but had its own hurdles. Some takeaways are well-structured and clear documentation, code examples and demos, and pre-trained models were at the core to the success of these case studies. Conclusions: Openly available DL resources were the core of the success of our case studies. The authors encourage DL researchers to continue to make their data, code, and pre-trained models openly available where appropriate.
- Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep LearningMiller, Chreston; Hamilton, Leah; Lahne, Jacob (MDPI, 2021-07-14)This paper is concerned with extracting relevant terms from a text corpus on whisk(e)y. “Relevant” terms are usually contextually defined in their domain of use. Arguably, every domain has a specialized vocabulary used for describing things. For example, the field of Sensory Science, a sub-field of Food Science, investigates human responses to food products and differentiates “descriptive” terms for flavors from “ordinary”, non-descriptive language. Within the field, descriptors are generated through Descriptive Analysis, a method wherein a human panel of experts tastes multiple food products and defines descriptors. This process is both time-consuming and expensive. However, one could leverage existing data to identify and build a flavor language automatically. For example, there are thousands of professional and semi-professional reviews of whisk(e)y published on the internet, providing abundant descriptors interspersed with non-descriptive language. The aim, then, is to be able to automatically identify descriptive terms in unstructured reviews for later use in product flavor characterization. We created two systems to perform this task. The first is an interactive visual tool that can be used to tag examples of descriptive terms from thousands of whisky reviews. This creates a training dataset that we use to perform transfer learning using GloVe word embeddings and a Long Short-Term Memory deep learning model architecture. The result is a model that can accurately identify descriptors within a corpus of whisky review texts with a train/test accuracy of 99% and precision, recall, and F1-scores of 0.99. We tested for overfitting by comparing the training and validation loss for divergence. Our results show that the language structure for descriptive terms can be programmatically learned.
- Structural Model Discovery in Temporal Event Data StreamsMiller, Chreston (Virginia Tech, 2013-04-23)This dissertation presents a unique approach to human behavior analysis based on expert guidance and intervention through interactive construction and modification of behavior models. Our focus is to introduce the research area of behavior analysis, the challenges faced by this field, current approaches available, and present a new analysis approach: Interactive Relevance Search and Modeling (IRSM). More intelligent ways of conducting data analysis have been explored in recent years. Ma- chine learning and data mining systems that utilize pattern classification and discovery in non-textual data promise to bring new generations of powerful "crawlers" for knowledge discovery, e.g., face detection and crowd surveillance. Many aspects of data can be captured by such systems, e.g., temporal information, extractable visual information - color, contrast, shape, etc. However, these captured aspects may not uncover all salient information in the data or provide adequate models/patterns of phenomena of interest. This is a challenging problem for social scientists who are trying to identify high-level, conceptual patterns of human behavior from observational data (e.g., media streams). The presented research addresses how social scientists may derive patterns of human behavior captured in media streams. Currently, media streams are being segmented into sequences of events describing the actions captured in the streams, such as the interactions among humans. This segmentation creates a challenging data space to search characterized by non- numerical, temporal, descriptive data, e.g., Person A walks up to Person B at time T. This dissertation will present an approach that allows one to interactively search, identify, and discover temporal behavior patterns within such a data space. Therefore, this research addresses supporting exploration and discovery in behavior analysis through a formalized method of assisted exploration. The model evolution presented sup- ports the refining of the observer\'s behavior models into representations of their understanding. The benefit of the new approach is shown through experimentation on its identification accuracy and working with fellow researchers to verify the approach\'s legitimacy in analysis of their data.
- Translating Sensory Perceptions: Existing and Emerging Methods of Collecting and Analyzing Flavor DataHamilton, Leah Marie (Virginia Tech, 2022-04-28)Food flavor is hugely important in motivating food choice and eating behavior. Unfortunately for research and communication about flavor, many languages' flavor vocabularies are notoriously variable and must be aligned before data collection using training or after the fact by researchers. This dissertation demonstrates one example of each approach (conventional descriptive analysis (DA) and labeled free sorting, respectively), and compares their use to emerging, computational natural language processing (NLP) methods that use large volumes of existing text data. Rapid methods that align flavor vocabulary after data collection are most similar to NLP, and with the development or improvement of some strategic tools, NLP is well-poised to further accelerate the analysis of existing text data or unaligned vocabularies. DA, while much more time-consuming, ensures that the researchers, tasters, and readers have a shared definition of any flavor words used, an advantage that all existing rapid methods lack. With a greater understanding of how this differs from everyday communication about flavor, future researchers may be able to replicate this aspect of DA in novel descriptive methods. This dissertation investigates the flavors of specialty beverages, specifically American whiskeys and cold brew coffees. American whiskeys differ from other whiskeys based on raw materials and aging practices, with the aging practices primarily setting them apart. While the most expensive American whiskeys are similar to Scotches and dominated by oaky, sultana-like flavors, only very rich consumers desire these flavors, with chocolate and caramel being the most widely preferred by most consumers. Degree of roasting has more of an impact on cold brew coffee flavor than the origin of the beans, and the coffee consumers surveyed here preferred dark roast to light roast cold brews.
- Virginia Tech University Libraries’ Data Service Pilot with the College of EngineeringMiller, Chreston; Ogier, Andrea; Coleman, Shane; Petters, Jonathan L. (2017-06-01)This report describes the results of our data needs assessment of the College of Engineering (CoE). With the growing focus on data, the needs of researchers can be unclear if not studied at the source. This led to the creation of this study in which the investigators interviewed faculty from across multiple departments within CoE. This focused study provided information and insight into the daily research practices related to data. The results of our study provided several categories and trends of data related needs and recommendations going forward as to how to better support the needs identified.