Browsing by Author "Bendelac, Alon"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Collection Management Tobacco Settlement Documents (CMT) CS5604 Fall 2019Muhundan, Sushmethaa; Bendelac, Alon; Zhao, Yan; Svetovidov, Andrei; Biswas, Debasmita; Marin Thomas, Ashin (Virginia Tech, 2019-12-11)Consumption of tobacco causes health issues, both mental and physical. Despite this widely known fact, tobacco companies had sustained their huge presence in the market over the past century owing to a variety of successful marketing strategies. This report documents the work of the Collection Management Tobacco Settlement Documents (CMT) team, the data ingestion team for the tobacco documents. We deal with an archive of tobacco documents that were produced during litigation between the United States and seven major tobacco industry organizations. Our aim is to process these documents and assist Dr. David M. Townsend, an assistant professor at Virginia Polytechnic Institute and State University (Virginia Tech) Pamplin College of Business, in his research towards understanding the marketing strategies of the tobacco companies. The team is part of a larger initiative: to build a state-of-the-art information retrieval and analysis system. We handle over 14 million tobacco settlement documents as part of this project. Our tasks include extracting the data as well as metadata from these documents. We cater to the needs of the ElasticSearch (ELS) team and the Text Analytics and Machine Learning (TML) team. We provide tobacco settlement data in suitable formats to enable them to process and feed the data into the information retrieval system. We have successfully processed both the metadata and the document texts into a usable format. For metadata, this involved collaborating with the above-mentioned teams to come up with a suitable format. We retrieved the metadata from a MySQL database and converted it into a JSON for Elasticsearch ingestion. For the data, this involved lemmatization, tokenization, and text cleaning. We have supplied the entire dataset to the ELS and TML teams. Data, as well as metadata of these documents, were cleaned and provided. Python scripts were used to query the database and output the results in the required format. We also closely interacted with Dr. Townsend to understand his research needs in order to guide the Front-end and Kibana (FEK) team in terms of insights about features that can be used for visualizations. This way, the information retrieval system we build would add more value to our client.
- DR_BEV: Developer Recommendation Based on Executed VocabularyBendelac, Alon (Virginia Tech, 2020-05-28)Bug-fixing, or fixing known errors in computer software, makes up a large portion of software development expenses. Once a bug is discovered, it must be assigned to an appropriate developer who has the necessary expertise to fix the bug. This bug-assignment task has traditionally been done manually. However, this manual task is time-consuming, error-prone, and tedious. Therefore, automatic bug assignment techniques have been developed to facilitate this task. Most of the existing techniques are report-based. That is, they work on bugs that are textually described in bug reports. However, only a subset of bugs that are observed as a faulty program execution are also described textually. Certain bugs, such as security vulnerability bugs, are only represented with a faulty program execution, and are not described textually. In other words, these bugs are represented by a code coverage, which indicates which lines of source code have been executed in the faulty program execution. Promptly fixing these software security vulnerability bugs is necessary in order to manage security threats. Accordingly, execution-based bug assignment techniques, which model a bug with a faulty program execution, are an important tool in fixing software security bugs. In this thesis, we compare WhoseFault, an existing execution-based bug assignment technique, to report-based techniques. Additionally, we propose DR_BEV (Developer Recommendation Based on Executed Vocabulary), a novel execution-based technique that models developer expertise based on the vocabulary of each developer's source code contributions, and we demonstrate that this technique out-performs the current state-of-the-art execution-based technique. Our observations indicate that report-based techniques perform better than execution-based techniques, but not by a wide margin. Therefore, while a report-based technique should be used if a report exists for a bug, our results should provide confidence in the scenarios in which only execution-based techniques are applicable.