Browsing by Author "Dhar, Siddharth"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Big Data Text Summarization - Hurricane IrmaChava, Raja Venkata Satya Phanindra; Dhar, Siddharth; Gaur, Yamini; Rambhakta, Pranavi; Shetty, Sourabh (Virginia Tech, 2018-12-13)With the increased rate of content generation on the Internet, there is a pressing need for making tools to automate the process of extracting meaningful data. Big data analytics deals with researching patterns or implicit correlations within a large collection of data. There are several sources to get data from, such as news websites, social media platforms (for example FaceBook and Twitter), sensors, and other IoT (Internet of Things) devices. Social media platforms like Twitter prove to be important sources of data collection since the level of activity increases significantly during major events such as hurricanes, floods, and events of global importance. For generating summaries, we first had to convert the WARC file which was given to us, into JSON format, which was more understandable. We then cleaned the text by removing boilerplate and redundant information. After that, we proceeded with removing stopwords and getting a collection of the most important words occurring in the documents. This ensured that the resulting summary would have important information from our corpus and would still be able to answer all the questions. One of the challenges that we faced at this point was to decide how to correlate words in order to get the most relevant words out of a document. We tried several techniques such as TF-IDF in order to resolve this. Correlation of different words with each other is an important factor in generating a cohesive summary because while a word may not be in the list of most commonly occurring words in the corpus, it could still be relevant and give significant information about the event. Due to the occurrence of Hurricane Irma around the same time as the occurrence of Hurricane Harvey, a large number of documents were not about Hurricane Irma. Due to this, all such documents were eliminated as they were deemed non-relevant. Classification of documents as relevant or non-relevant ensured that our deep learning summaries were not getting generated on data that was not crucial in building our final summary. Initially, we attempted to use Mahout classifiers, but the results obtained were not satisfactory. Instead, we used a much simpler world filtering approach for classification which has eliminated a significant number of documents by classifying them as non-relevant. We used the Pointer-Generator technique, which implements a Recurrent Neural Network (RNN) for building the deep learning abstractive summary. We combined data from multiple relevant documents into a single document, and thus generated multiple summaries, each corresponding to a set of documents. We wrote a Python script to perform post-processing on the generated summary to convert all the alphabetic characters after a period and space to uppercase. This was important because for lemmatization, stopword removal, and POS tagging, the whole dataset is converted to lowercase. The script also converts the first alphabetic character of all POS-tagged proper nouns to upper case. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to evaluate the generated summary against the golden standard summary. The abstractive summary returns good evaluation results when compared with the Golden Standard on the ROUGE_sent evaluation. The ROUGE_para and cov_entity evaluation results were not up to the mark, but we feel that was mainly due to the writing style of the Gold Standard as our abstractive summary was able provide most of the information related to Hurricane Irma.
- Only pay for what you need: Detecting and removing unnecessary TEE-based codeLiu, Yin; Dhar, Siddharth; Tilevich, Eli (Elsevier, 2022-06-01)A Trusted Execution Environment (TEE) provides an isolated hardware environment for sensitive code and data to protect a system's integrity and confidentiality. As we discovered, programmers tend to overuse TEE protection. When they place non-sensitive code in TEE, the trusted computing base (TCB) grows unnecessarily, leading to long execution latencies and large attack surfaces. To address this problem, we first study a representative sample of open-source projects to uncover how TEE is utilized in real-world software. To facilitate the process of removing non-sensitive code from TEE, we introduce TEE Insourcing, a new type of software refactoring that identifies and removes the unnecessary program parts out of TEE. We implemented TEE Insourcing as the TEE-DRUP framework, which operates in three phases: (1) a variable sensitivity analysis designates each variable as sensitive or non-sensitive; (2) a TEE-aware taint analysis identifies non-sensitive TEE-based functions; (3) a fully-declarative program transformation automatically moves these functions out of TEE. Our evaluation demonstrates that our approach is correct, effective, and usable. By deploying TEE-DRUP to discover and remove the unnecessary TEE code, programmers can both reduce the TCB's size and improve system performance.
- Optimizing TEE Protection by Automatically Augmenting Requirements SpecificationsDhar, Siddharth (Virginia Tech, 2020-06-03)An increasing number of software systems must safeguard their confidential data and code, referred to as critical program information (CPI). Such safeguarding is commonly accomplished by isolating CPI in a trusted execution environment (TEE), with the isolated CPI becoming a trusted computing base (TCB). TEE protection incurs heavy performance costs, as TEE-based functionality is expensive to both invoke and execute. Despite these costs, projects that use TEEs tend to have unnecessarily large TCBs. As based on our analysis, developers often put code and data into TEE for convenience rather than protection reasons, thus not only compromising performance but also reducing the effectiveness of TEE protection. In order for TEEs to provide maximum benefits for protecting CPI, their usage must be systematically incorporated into the entire software engineering process, starting from Requirements Engineering. To address this problem, we present a novel approach that incorporates TEEs in the Requirements Engineering phase by using natural language processing (NLP) to classify those software requirements that are security critical and should be isolated in TEE. Our approach takes as input a requirements specification and outputs a list of annotated software requirements. The annotations recommend to the developer which corresponding features comprise CPI that should be protected in a TEE. Our evaluation results indicate that our approach identifies CPI with a high degree of accuracy to incorporate safeguarding CPI into Requirements Engineering.
- Toward an Intelligent Crawling Scheduler for Archiving News Websites Using Reinforcement LearningWang, Xinyue; Ahuja, Naman; Llorens, Nathaniel; Bansal, Ritesh; Dhar, Siddharth (Virginia Tech, 2019-12-03)Web crawling is one of the fundamental activities for many kinds of web technology organizations and companies such as Internet Archive and Google. While companies like Google often focus on content delivery for users, web archiving organizations such as the Internet Archive pay more attention to the accurate preservation of the web. Crawling accuracy and efficiency are major concerns in this task. An ideal crawling module should be able to keep up with the changes in the target web site with minimal crawling frequency to maximize the routine crawling efficiency. In this project, we investigate using information from web archives' history to help the crawling process within the scope of news websites. We aim to build a smart crawling module that can predict web content change accurately both on the web page and web site structure level through modern machine learning algorithms and deep learning architectures. At the end of the project: We have collected and processed raw web archive collections from Archive.org and through our frequent crawling jobs. We have developed methods to extract identical copies of web page content and web site structure from the web archive data. We have implemented baseline models for predicting web page content change and web site structure change, web page content change with supervised machine learning algorithms; We have implemented two different reinforcement learning models for generating a web page crawling plan: a continuous prediction model and a sparse prediction model. Our results show that the reinforcement learning modal has the potential to work as an intelligent web crawling scheduler.