CS4624: Multimedia, Hypertext, and Information Access
Permanent URI for this collection
This collection contains the final projects of the students in in the course Computer Science 4624: Multimedia, Hypertext, and Information Access, at Virginia Tech.
This course, taught by Professor Ed Fox, is part of the Human-Computer Interaction track, the Knowledge, Information, and Data track, and the Media/Creative Computing track. The curriculum introduces the architectures, concepts, data, hardware, methods, models, software, standards, structures, technologies, and issues involved with: networked multimedia (e.g., image, audio, video) information, access and systems; hypertext and hypermedia; electronic publishing; virtual reality. Coverage includes text processing, search, retrieval, browsing, time-based performance, synchronization, quality of service, video conferencing and authoring.
Browse
Browsing CS4624: Multimedia, Hypertext, and Information Access by Content Type "Dataset"
Now showing 1 - 20 of 21
Results Per Page
Sort Options
- Autism Support PortalQuayum, Sib; Galliher, Ryan; Nagies, Kenneth; Ritchie, Ayumi (Virginia Tech, 2018-05-08)The Autism Support Portal project involves the creation of a portal site that helps users find information they need about autism. The primary goal of the project is to help users quickly find credible information for their specific need. With the amount of information available online, it can be hard for those interested in autism to find information that is not only credible but useful and updated to reflect current research. The site needs to be easy to use both for the users and for the future administrators of the site. The site also needs to help guide people towards reliable resources while potentially exposing users to new resources. To ensure that our project meets the needs of our potential users, the project was divided into different phases involving data collection, research, design, and implementation. To gather data for our project, we used resources such as the Virginia Tech Center for Autism Research and their connections, to send out anonymous surveys to some of our potential users. We asked several questions pertaining to their interests in the site, what they needed from the site, and what resources were useful to them. This data allowed us to implement a site as specific to the user needs as possible while also giving us other resources to collect credible information from. In addition, Dr. Scarpa provided a lot of other resources that allowed us to solve some of the needs of users, with other resources allowing this project to focus entirely on the implementation of our search engine and the guiding of our users towards effective answers, solutions, and resources. Upon entering the site, users have direct access to the search and are provided with search tips and external resources to help them. The site is set up entirely using WordPress.org. WordPress was chosen to be the CMS or content management system for the site because it is very easy to use and allows administrators to do a lot for the site without the need for extensive technical knowledge. The site needs to be very easy to modify and change after its initial set up so that those who work on it at the Virginia Tech Center for Autism Research can do so quickly. However, using solely WordPress and its plugins created a variety of new obstacles stemming from the different uses of different plugins. To save time and money, research needed to be done on several different plugins to find the ones that not only met the needs of the site but that were also affordable. Even with these obstacles, using WordPress not only allows for easier creation and maintenance, but also easy modification of the site if additional features are wanted or needed. The design of the site allows users to find necessary information very quickly through alphabetically sorted lists that will expose the user to terms that may have been unknown previously. One of the problems with researching autism is asking the right questions. For example, a child with a special need such as autism needs an IEP or individualized education program, which requires a specific search for an IEP. When a user explores education information, the user also needs to be shown some specifics such as IEPs. This example also serves as an example of the need to have our site easily modifiable, as a change in law or name would require someone to change the resource in the site. Using the data and implementation techniques discussed, the end result portal is composed of help and resource pages as well as a refined search that links questions to reliable answers. In addition, the site is designed such that any user without prior technical experience can use the site and adjust the sites that are searched and any other information within the site that is changed.
- Database Creation and Information Extraction from ETDs for CRA-E(2013-05-18)This project was in support of the educational activities of the Computing Research Association (CRA-E). The main point of the project was to collect data associated with electronic theses and dissertations (ETDs) to allow determination of why graduate students in computing go into computing research. The deliverables include a database of the data extracted from the ETDs analyzed and a framework for machine learning and manual approaches to this data extraction. To accomplish these objectives, ETDs from North Carolina State University (NCSU), Florida State University (FSU), Auburn University (AU), Wake Forest University (WFU), and Virginia Tech (VT) were analyzed and results were inserted into the database. The Extensible Markup Language (XML) was decided upon as the structuring format for the data extracted from ETDs, and a tag structure was created utilizing biographical, educational, and institutional data from each ETD. Some of the tags included: author name, title of the paper, year published, undergraduate institution of the author, etc. XML was chosen because of its prevalence in the ETD field, its structural properties, and its ease of use. These tags were used to create the attributes for each entry in the database in Microsoft Access. Access was chosen mostly because of convenience and easy porting of tags into the system. However, the database could be moved into another system quite easily. Challenges that arose included missing data or insufficient information in various areas. The second deliverable took the form of instructions (pg. 4 in the report) that could be given to an Amazon Mechanical Turk user in how to extract information. These instructions were created and provided in order to increase speed and decrease errors in manual data extraction. It was found that the basic structure of most ETDs is similar and is normally in this approximate order (dependent on institution of origin): title page, table of contents, abstract, actual content, biography, acknowledgements, and resume (not normally present). In these, all but the table of contents and the paper itself contains required information for the database. The instructions provide the most common locations for each tag/attribute and alternate locations (if any were found). They also instruct the Mechanical Turk user what to do in case of missing data for each attribute.
- English Wikipedia on Hadoop ClusterStulga, Steven (2016-05-04)To develop and test big data software, one thing that is required is a big dataset. The full English Wikipedia dataset would serve well for testing and benchmarking purposes. Loading this dataset onto a system, such as an Apache Hadoop cluster, and indexing it into Apache Solr, would allow researchers and developers at Virginia Tech to benchmark configurations and big data analytics software. This project is on importing the full English Wikipedia into an Apache Hadoop cluster and indexing it by Apache Solr, so that it can be searched. A prototype was designed and implemented. A small subset of the Wikipedia data was unpacked and imported into Apache Hadoop’s HDFS. The entire Wikipedia Dataset was also downloaded onto a Hadoop Cluster at Virginia Tech. A portion of the dataset was converted from XML to Avro and imported into HDFS on the cluster. Future work would be to finish unpacking the full dataset and repeat the steps carried out with the prototype system, for all of WIkipedia. Unpacking the remaining data, converting it to Avro, and importing it into HDFS can be done with minimal adjustments to the script written for this job. Continuously run, this job would take an estimated 30 hours to complete.
- Fusality for Stream and FieldBologna-Jill, Stephen; Duong, Kevin; Ha, Jason Yongjoo; Zurita, Jazmine; Sume, Tinsaye; Smith, Ryan (Virginia Tech, 2017-04-28)This project provides users with a means to organize, graph, and analyze specific data recorded from Stroubles Creek. The website will be utilized by an Undergraduate Biological Systems Engineering class to help with their labs that deal with the health of Stroubles Creek. Our team was designated with the task of improving a website that was created by a past Computer Science capstone team. The website we started with was barely functional and could not yet be used by the Undergraduate Biological Systems Engineering class. The website required many modifications in both the front end interface as well as the backend. Our team split up into three two-person groups based off of skill and desired learning objectives. These teams include a backend team, a front end user-interface team, and a data graphing team. The main front-end improvements that were enacted on the website include a complete overhaul of the entire user-interface and the addition of a usable navigation bar that enables users to easily use all features of the website. Backend improvements include major changes to the tables in the MySQL database as well as PHP functions that make utilizing the database extremely easy for the data graphing team. The changes made to the database tables allowed for a more straightforward representation of the data and enabled saving graphs for a specific experiment. Most of the improvements were on the data graphing aspect of the website. Users are now able to analyze six years of data collected from Stroubles Creek. They can analyze this data by creating either line graphs or scatter plots of whatever specific creek data they want. The graphs provide users with the ability to see trends in creek health over the course of many years. Currently, the website is ready to be used by Undergraduate Biological Systems Engineering classes. It provides all the functionality that our client required and does so in a clean, easy-to-use manner. Even though the website is ready for use, there are still areas that can be improved upon. These areas include more graph options, easier ways to upload new data sets, graphing large amounts of data points, and the aesthetics of the graphs.
- Identifying Drug Related Events from Social MediaNoh, Jeongho; Kim, Sungho; You, Jisu; Yoonju, Lee; Kye, Woojin (Virginia Tech, 2017-05-10)The overall goal of the project was to establish an innovative information system, which can automatically detect and extract content related to side effect of drugs from user reviews, determine whether they are talking about effectiveness or adverse drug events, extract keywords or phrases related to effectiveness or adverse drug events, and visualize the resulting information to doctors and patients. Our group was provided with crawled Twitter reviews and social network forum reviews on drugs that are used to treat diabetes. The raw data were manually labeled in four different label for named entity recognition in order to create training, testing, and validation sets. Using the training data set, a side effect dictionary was created using PamTAT. Side effect dictionary was then refined by removing neutral words to increase accuracy. To validate the accuracy of the generated side effect dictionary, the results of side effect analysis based on the generated dictionary and two other general negative word dictionaries were compared. The generated side effect dictionary performed better in recognizing side effect entities. After validation, the generated dictionary was further tested with a set of user reviews on a drug that is used to treat stroke. Using generated dictionary, the project accomplished to accurately determine if any reviews relates to the mention of side effect of specific drugs. The project successfully delivered to accurately detect mention of side effect from the reviews in > 90% accuracy. Resulting algorithm can be used to create innovative information system to detect and extract content related to side effect of drugs for any other drugs with creation of problem specific dictionary. The project should be further developed to incorporate automatic extraction of user reviews, analysis of data, and visualization of results.
- Interactive Website for Values Diagnostic Reporting and AnalysisBrizuela, Xavier; Stewart, Caroline; Mistry, Harsh; Eltepu, Sunny (Virginia Tech, 2018-05-11)The goal of this project is to help sustainability professionals and students learn about their values and biases that impact their work as facilitators. This project focuses on an interactive website which contains a values diagnostic used for analysis and reporting through a survey constructed by Dr. Bruce Hull. The results of the survey will show the survey taker where they stand on sustainability issues. The website was created to optimize the process of parsing the data and outputting a result that Dr. Hull can use as a learning tool. The project is broken up into three different parts: the website platform, data input, and the data output. The website platform needs to be easy manageable. WordPress is a free and open-source content management system, that was the best choice. The Qualtrics survey is the main source of data input. The survey will be static to ensure accurate comparisons of previous data to a user’s current data. It contains two different types of questions: in the first type, the user will be asked to allocate a total global budget of $100 among six choices to determine the outcomes of sustainable development efforts, while in the second type the user is given different scenarios and they must choose their degree of agreement on a scale from strongly disagree to strongly agree. The data analytics will be automated using the Qualtrics API and the Pandas library in Python. The data output is a report which clearly communicates the user’s values and biases, while also displaying how they compare to previous users, to assist the user in learning about their stance on sustainability.
- Neural Network Doc SummarizationCheng, Junjie (Virginia Tech, 2018-05-07)This is the Neural Network Document Summarization project for the Multimedia, Hypertext, and Information Access (CS 4624) course at Virginia Tech in the 2018 Spring semester. The purpose of this project is to generate a summary from a long document through deep learning. As a result, the outcome of the project is expected to replace part of a human’s work. The implementation of this project consists of four phases: data preprocessing, building models, training, and testing. In the data preprocessing phase, the data set is separated into training set, validation set, and testing set, with the 3:1:1 ratio. In each data set, articles and abstracts are tokenized to tokens and then transformed to indexed documents. In the building model phase, a sequence to sequence model is implemented by PyTorch to transform articles to abstracts. The sequence to sequence model contains an encoder and a decoder. Both are implemented as recurrent neural network models with long-short term memory unit. Additionally, the MLP attention model is applied to the decoder model to improve its performance. In the training phase, the model iteratively loads data from the training set and learns from them. In each iteration, the model generates a summary according to the input document, and compares the generated summary with the real summary. The difference between them is represented by a loss value. According to the loss value, the model performs back propagation to improve its accuracy. In the testing phase, the validation dataset and the testing dataset are used to test the accuracy of the trained model. The model generates the summary according to the input document. Then the similarity between the generated summary and the real human-produced summary are evaluated by PyRouge. Throughout the semester, all of the above tasks were completed. With the trained model, users can generate CNN/Daily Mail style highlights according to an input article.
- NRV Tweets and RSS feedsRoble, Benjamin; Cheng, Justin; Sbitani, Marwan (2014-05-09)The goal of this project was to associate existing data in the Virtual Town Square database from the New River Valley area with topical metadata. We took a database of approximately 360,000 tweets and 15,000 RSS news stories collected in the last two years and associated each RSS story and tweet with topics. The open-source natural language processing library Mallet was used to perform topical modeling on the data using Latent Dirichlet Allocation, which was then used to create a Solr instance of searchable tweets and news stories. Topical modeling was not done around specific events, instead the entire tweet data (and entire RSS data) was used as the corpus. The tweet data was analyzed separately from the RSS stories, so the generated topics are specific to each dataset. This report details the methodology used in our work in the Methodology section and contains a detailed Developer’s Guide and User’s Guide so that others may continue our work. The client was satisfied with the outcome of this project as, even though tweets have generally been considered too short to be run through a topical modeling process, we generated topics for each tweet that appear to be relevant and accurate.
- Paleontology Topic TrendsWilson, James; Martin, Joseph; Cruz, Rudy; Weiler, Eric (Virginia Tech, 2018-04-03)The purpose of the project was to run modern data analysis on abstracts created by the Society of Vertebrate Paleontology. The Society of Vertebrate Paleontology has a yearly convention in which members from all over the world gather together and present their studies from the appropriate year. Our client, Professor Sterling Nesbit, provided our group with a collection of abstracts dating back to 1987. Our job was to take all of the abstracts from each year and run analyses to see the trends and patterns spanning over all the years that the Society of Vertebrate Paleontology had been publishing abstracts in collections. The method the team has employed changed throughout the span of the project. In the beginning, the team planned on using Latent Dirichlet Allocation or LDA to summarize the abstracts. This would find the topics prevalent in the collection, and show the mix of those topics found in each of the abstracts. After further discussion with our client, the team decided on providing more straightforward analysis, based off graphing hierarchies in the abstracts. In order to properly run the graphing analysis on the abstracts our team had to scrape the abstracts to ensure the most useful data was not overlooked in the analysis. The process of scraping the abstracts began with removing all the hypertext markup tags from the abstract text files (which were converted from PDF). Then the team eliminated any English stop words in the text files to remove words that are not commonly needed for analysis. The next step was to customize and add words to this list of stop words, based on yearly differences. For example, in some years the Society of Vertebrate Paleontology required its members to create their abstracts referencing the United States as “The United States of America” while in other years they were required to reference it as “United States.” These slight changes required our team to alter our method of stop word elimination to be specific to each year. Once the scraping was done, the team created graphing scripts to produce graphs based off Vertebrate Paleontology hierarchies. After meeting with our client multiple times to further refine our analysis, we created the final analysis script version. These graphs helped our client visualize the patterns in findings made by the Society of Vertebrate Paleontology. The project should be further developed to automatically extract abstracts from the convention’s PDF collection, as well as some sort of update to stop words based off of the society’s yearly modifications.
- Product Defect MiningVillaflor, Elizabeth M.; Golden, Grant D.; Hall, Jack W. W.; Nguyen, Thomas; Peng, Tianchen; Zhang, Shuaicheng (Virginia Tech, 2017-05-01)This project is focused on customer reviews on various product defects. The goal of the project is to use machine learning algorithms to train on sets of these customer reviews in order to be able to easily identify the different defect entities within an unseen review. The identification of these entities will be beneficial to customers, product manufacturers, and governments as it will shed light on the most common defects for a certain product, as well as common defects across a class of products. Additionally, it will bring to light common resolutions for defect symptoms, including both correct and incorrect resolutions. This project also aims to make contributions to the opinion mining research community. These goals will be accomplished by breaking the project into three main parts: data collection, data labeling, and classifier training. In the data collection phase, a web crawler will be created to pull customer reviews off of forum sites in order to create new datasets. For data labeling, datasets, both pre-existing and newly created, will be split into sentences and be assigned a defect entity based on the content of the sentence. For example, if a sentence describes a product defect, the sentence will be labeled as a symptom, and so on. Finally, in the classifier training portion of the project, machine learning algorithms will be used to classify unlabeled datasets in order to learn what types of words indicate a certain defect entity. While these are the three main aspects of the project, there are other minor phases and categories of work that will be necessary. One of these sub-phases includes designing the database tables that will be used to store the labeled datasets. Throughout the semester the following was accomplished: the creation of a web crawler, the completion of five new datasets, the labeling of five datasets, and preliminary training results based on the linear SVC algorithm. Additionally, the new datasets and labeled datasets were uploaded into the client’s preexisting database. The new datasets were collected from the Apple Community, Samsung, and Dell forum boards and include product defect reports for both hardware and software products. Based on the labeling results, and quick scans of the collected data, it was found that many defect reports contain contextual information that is not directly related to the description of either a product defect or its corresponding solution. Additionally, it was found that many reports do not include resolutions or the resolution did not actual solve the defect described. The linear SVC algorithm used for classifier training was able to accurately predict the label for a sentence about 80% of the time when training and testing occurred on similar products, i.e. two different car models. However, the accuracy was only about 60% at best when used on two completely different products, i.e. cars vs cellphones. Overall, about 75% of the anticipated work was completed this semester. The work that was completed should provide a good foundation for continued work in the future.
- Rdoc2vec CS4624 Project for Spring 2017Cooke, Austin; Clark, Jake; Rolph, Steven; Sherrard, Stephen (Virginia Tech, 2017-04-28)This submission includes deliverables for the capstone project Rdoc2vec. It was created by Jake Clark, Austin Cooke, Steven Rolph, and Stephen Sherrard for their client, Eastman Chemical Corporation. Doc2Vec is a machine learning model to create a vector space whose elements are words from a grouping or several groupings of text. By analyzing several documents, all of the words which occur in these documents are placed into the vector space. The distance between these vectors indicates how similar they are. Words which appear in similar contexts have a small distance between them in this vector space. This algorithm has been used by researchers for document analysis, primarily using the Gensim Python library. Our client, Eastman Chemical Corporation, would like to use this approach, when working in a language more suited to their business model. A lot of their software is statistical, written in R. Thus, our job had the following components: become familiar with Doc2vec and R, develop Rdoc2vec, and apply it to parse documents, create a vector space, and make tests. First, to become familiar with the language, we spent a few weeks with tutorials including the Lynda library, which was provided by Virginia Tech. After we felt we were familiar with the language, we learned about two of the dominant algorithms used, called Distributed Bag-of-words (DBOW) and Distributed Memory (DM). After learning these two algorithms, we felt that we were prepared to begin development. Second, we developed a class structure similar to that of Gensim. Keeping this as a skeleton, we developed a parsing algorithm which would be used to train the model. The parser analyzes the documents and computes a frequency for the occurrence of each word. The parser itself takes a list of physical documents stored on the system and completes the analysis, passing the frequency of words along the pipeline. The next step was to create a neural network for training the model. We elected to use the built-in neural network library written in R called nnet. A neural network takes an initial input vector as a parameter. For our purposes, it made sense to use a “one-hot” vector, which has only one input. This can cut down on later calculation because the input vector is only of size one. Then this input is multiplied by several weights to be put into a hidden layer, handled by the nnet library. The values in the hidden layer are multiplied again by several weights to go into the output layer. After creating functions which called the nnet library, we began work on testing. In the meantime, we decided to begin a design on our own implementation of a neural network. By creating a neural network anew, we get around the major problem with nnet, which is optimization. Since nnet is a black box that we cannot affect, we cannot be sure that it is optimized for our application. Since we use “one-hot” vectors, which are not a default application, it is likely that there is some way we can improve the speed in our library. We were not able to finish and test our neural net, so it is something left for future groups to work on. Finally, we began testing. We created a Web scraper which grabbed a number of articles from Wikipedia. We used this scraper to get a number of different documents. Specifically, we scraped information on the congressional districts of several states. This gave us document sets which can be quite large when using several states, or smaller by analyzing individual states. We performed tests on these datasets, the results of which we kept with our code.
- Satellite Image Finder Parking Lot & SpotsJahnig, Patrick; Lambrides, Alex; Le, Khoa; Wolfe, Thomas (Virginia Tech, 2018-05-02)Satellite imagery in recent years has drastically increased in both quality and quantity. Today, the problem is too much data. Map features such as roads, buildings, and other points of interest are mainly extracted manually, and we just don’t have enough humans to carry out this mundane task. Satellite imagery in recent years has drastically increased in both quality and quantity. Today, the problem is too much data. Map features such as roads, buildings, and other points of interest are mainly extracted manually, and we just don’t have enough humans to carry out this mundane task. The goal of this project is to develop a tool that automates this process. Specifically, the focus of this project is to extract parking lots using Object Based Imagery Analysis. The final deliverable is a Python tool that uses Machine Learning algorithms to identify and extract parking lots from high resolution satellite imagery. This project was divided into two main steps: labeling data and training an algorithm. For the first step, the project team gathered a large dataset of satellite imagery in the form of GeoTIFFs, used GDAL to convert these files into JPEG image files, and used labelImg to label the images. The labelling consisted of creating an XML layer corresponding to each GeoTIFF image, where the XML layer contained bounding boxes outlining each parking lot. With all of the training data labeled, the next step was training the algorithm. The project lead tried several different models for the learning algorithm, with the final model being based on Faster RCNN. After training, the project team tested the model and determined the accuracy was too low, so the team decided to obtain and label more images to improve it. Once the accuracy met the determined standards, a script was built that would take an input of a GeoTIFF image, convert this to a JPEG image, run the image on the model to detect any parking lots and output bounding boxes depicting those parking lots, and finally, convert these bounding boxes into a single GeoJSON file. The main use case of the application is quickly finding parking lots with relative accuracy in satellite imagery. The model can also be built upon to be improved or used in related tasks, for example detecting individual parking spots. The project has managed to achieve the expected goals using labelImg and a Faster RCNN model. However, due to a limitation of labelImg, the model cannot detect parking lots that are not horizontal or vertical. The project team researched several methods to solve this problem but were not able to fully implement a suitable solution due to time and infrastructure constraints. The team has described all of its research in this final report so that those who want to improve on this project will have a good starting point. Note that there are some additional files that had to be uploaded onto Google Drive: https://drive.google.com/open?id=1istIOQqsQdw43Ty08KdoY64qBUUlI9D_ and https://drive.google.com/open?id=1_EPq0hgRfSsLOPsXiVWV4d2Dz8yxmm4h
- Save the PenguinsAvant, Joey; Merryman, David (2012-05-06)The purpose of this project was to create a promotional video for Studio STEM's after school 'Save the Penguins' program. This program was created to get middle school aged students interested in critical thinking and performing experiments using the scientific method. Students would have an ice cube which represented a penguin, and they would construct a house for it out of different materials to protect it from the sun (a heat lamp). The materials would be tested to see which are the most effective at insulating the house from the heat lamp. The students would design their house based on data they obtained from experiments, and then further refine their design based on how well it fared in tests.
- Text TransformationThompson, Dustin; Henke, Zach; Cox, Kevin; Fenton, Kevin (2015-05-14)The purpose of this project is to assist the VTTI in converting a large citation file into a CSV file for ease of access. It required us to develop an application which can parse through a text file of citations, and determine how to properly put the data into CSV format. We designed the program in Java and developed a user-interface using JavaFX, which is included in the latest edition of Java. We came up with two main tools: the developer tool and the parsing program itself. The developer tool is used to build a tree made up of regular expressions which would be used in parsing the citations. The top nodes of the tree would be very general regexes, and the leaf nodes of the tree would become much more specific. This program can export the regex tree as a binary file which will be used by the main parsing program. The main parsing program takes three inputs: a binary regex tree file, a citation text file, and an output location. Once run, it parses the citations based off of the tree it was given. It outputs the parsed citations into a CSV file with the citations separated by field. For any citations that the program is unable to process, it dumps them into a failed output text file so. We also created an additional program as an alternative solution to ours. It uses Brown University’s FreeCite parsing program, and then outputs parsed citations to a CSV file.
- Tweet CollectionsChenault, Kirk P.; Keener, Chris L.; Chang, Brandon P.; Widrig, Joseph (Virginia Tech, 2018-05-07)Over the past decade, social media use has grown exponentially. More and more people are using social networks to connect and communicate with one another, which has given birth to a new source of data: social media analysis. Since Twitter is one of the largest platforms for text based user input, many tools have been created to analyze data from this social media network. The TweetCollections project is designed to analyze large amounts of tweet collection metadata, and provide additional information that makes the tweet collections easy to categorize and study. Our clients, Liuqing Li and Ziqian Song, have provided our team with a set of tweet collections and have asked us to assign metadata to them so that future researchers are able to easily find relevant collections. This includes assigning tags and categories, as well as a description with an accompanying source. Formerly, this process had been done by hand. While this improves the accuracy of the data collected, it is too expensive and time consuming to maintain. Our team has been tasked with speeding up the process, using scripts to find information for these fields and fill them out. The majority of technology used in our approach has been concentrated on Python and its many libraries. Python has made it easy to quickly parse through our tweet collection data by treating the input as an Excel file, as well as pulling other relevant information from third party sources like Wikipedia. The driver will create a new, updated Excel file with the additional data, categories, and tags. Additionally, an ontology will be produced and serve as reference for categorizing topics listed in the fields from the input. The GETAR team has created over 1400 tweet collections, containing over two billion tweets. To help categorize this data, they also store metadata about these collections in a Comma Separated Value (.csv) file. This project will result in a product that will take in a CSV file of the archive of tweet collections metadata as input, with the required fields (such as “Keyword” and/or “Date”) filled in, and produce a separate Comma Separated Value file as output with missing fields filled in. The overarching problem is that each category term is rather vague, and more data will need to be pulled out of this term. Additionally, an ontology will be produced and serve as reference for categorizing topics listed in the fields from the input. The completed project contains three Python scripts: csv_parser.py, search_wikipedia.py, and GUI.py. Together, these create a program that can take in an input CSV file and integer range for which lines to run, and then return a new CSV file with the additional metadata filled in. Also included with the deliverable is a populated Excel file, with over 150 additional entries of metadata, and an error file containing recommendations for the ontology. These recommendations are generated from any results our driver determines as ‘low relevance’, and returns options with a higher term frequency.
- Tweet URL AnalysisLi, Liyan; Lyu, Kehan; Sun, Guoxin (Virginia Tech, 2018-05-02)The goal of the GETAR project is to devise interactive, integrated, digital library/archive systems coupled with linked and expert-curated web-page/tweet collections. In this class team project, the URL analysis system we designed takes a tweet collection as input and uses Hadoop and Spark to extract short URLs. We expanded them, fetched their web-page with the corresponding long URL, and applied the WayBack CDX Server API to attempt to restore the most likely snapshot. Then, we conducted a systematic URL analysis, for different types of events. We analyzed nine tweet collections in four categories: Nature, Health, Man-made, and Particular Event. Each tweet collection contains the tweet content from 2013-2017 that related to a specific keyword. For each collection, we analyzed several characteristics in URLs, top-k domains of the URLs, URL retrieve rate, and URL retrieve rate boosted by using the WayBack CDX Server API. We provided several visualizations of the results we analyzed from these nine tweet collections. We have refined this project so that it is easy to build on; see section 5 (Developer Manual) in the final report for details.
- Tweet URL Extraction CrawlingBridges, Chris; Chun, David; Tat, Carter (Virginia Tech, 2018-05-02)In the report and supplemental code, we document our work on the Tweet URL extraction project for CS4624 (Multimedia/Hypertext/Information Access) during the spring 2018 semester at Virginia Tech. The purpose of this project is to aid our client Liuqing Li with his research in archiving digital content, part of the Global Event and Trend Archive Research (GETAR) project supported by NSF (IIS-1619028 and 1619371). The project requires tweet collections to be processed to find links most relevant to their respective events, which can be integrated into the digital library. The client has more than 1,400 tweet collections with over two billion tweets, and our team found a solution that used machine learning to deliver event related representative URLs. Our client requested that we use a fast scripting language to build middleware to connect a large tweet collection to an event focused URL crawler. To make sure we had a representative data set during development, much of our development has centered around a specific tweet collection, which focuses on the school shooting that occurred at Marshall High School in Kentucky, USA on January 23, 2018. The event focused crawler will take the links we provide and crawl them for the purpose of collecting and archiving them in a digital library/archive system. Our deliverables contain the following programs: extract.py, model.py, create_model.py, and conversion.py. Using the client’s tweet collection as input, extract.py scans the comma separated values (CSV) files and extracts the links from tweets containing links. Because Twitter enforces a character limit on each tweet, all links are initially shortened. Extract.py converts each link to a full URL then saves them to a file. The links at this stage are separate from the client’s tweet collection and are ready to be made into testing and training data. All of the crucial functionalities in our program are supported by open source libraries, so our program did not require any funds to develop. Further developments of our software could create a powerful solution for our client. We believe certain functions within our code could be reused and improved upon, such as the extractor, model, and the data we used for testing and training.
- Twitter Equity Firm ValueSmith, Jacob; Wiskur, Christian; Guinn, Nathaniel; Agren, Erik; Rane, Rohan (Virginia Tech, 2018-05-09)We analyzed how a company's response on social media (Twitter) can affect their stock market value following a data breach. Given a list of all data breaches since 2006 we collected their stock value for 150 days before the data breach and 120 after. Using a Fama French Model we came up with an abnormality value that demonstrated how the stock would have performed if no data breach occurred. While doing this we simultaneously collected tweets from the companies and customers about the data breach. We wanted to compare the stock performance to things such as the number of replies from a company, customer tweet sentiment, and links tweeted by the company. The way we did all of this work was by building Python scripts for all of the functionalities. When scraping the tweets the user would just need to supply a CSV file with the company's Twitter handle and company name. The other Python scripts used, do things like compute the abnormality difference from the client's Fama French Model, scrub the stock data to only have the date range needed, compute tweet sentiment, and grab client profiles. Our conclusion was that companies need to make few but comprehensive announcement tweets to decrease reply tweets. This could keep the sentiment of client tweets positive. Lastly, companies need to focus on replying to customer tweets to also keep sentiment positive.
- Vertebrate Map VisualizationDuncan, Courtney; Garcia-Neal, Christian; Mehdi, Wasay; Urcia, Andre (Virginia Tech, 2019-05-12)Our client, Dr. Mims, and a team of researchers, collected trait data on lesser-known vertebrate species in the northwestern United States. The goal of this research was to find links from trait to climate change vulnerability. She then published her data in a report that was made available through VertNet. Since the research comes from publicly available museum records it is only fitting to create a publicly accessible website to not only access the research but to engage the public on this important issue. The goal of our project was to make a multiple page website with quick links, resources, and research all attached to their respective vertebrate/species. We also made sortable lists of the species based off of their trait data. Also to be included with our website is a manual on how to extend or maintain the website for future use and extensibility when we are no longer working on the website. Another focus of the website is an informative visualization/infographic map that allows users to investigate the data of the species and their populations in different regions. Different parts of the map should be linked from each species individual page for easy association of information. Advancement on the infographic/visualization map that allows for input to clarify or maintain interest in the relevant data. Easy to understand controls that allow for detailing or generalizing parts of the map to meet criteria for different areas of interest or research. Included are the files of trait data given to us by Dr. Mims and our final presentation. This trait data is for the species represented by our website.
- Visual Displays of School Shooting DataWoodson, Tianna; Simmons, Gabriel; Park, Peter; Doan, Tomy; Keys, Evan (Virginia Tech, 2018-05-02)In order to understand and track emerging trends in school violence, there is no better resource than our current population. Sixty-eight million Americans have a Twitter account and with the help of the GETAR (Global Events and Trend Archive Research) project, we were able to create datasets of tweets related to 10 school shooting events. Also, we have retrieved the URLs of news headlines relating to the same shootings. Our job is to use both datasets to develop visualizations that may depict emerging trends. Based on the data that we had available, we were able to come up with a few ideas such as word clouds, maps, and timelines. The goal was to choose appropriate representations that would provide insight into the changing conversation America was having about gun violence. We have been successful in creating these visuals and shifted our focus to cleaning our data.