Browsing by Author "Mehrotra, Maanav"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- Generating Canonical Sentences from Question-Answer Pairs of Deposition TranscriptsMehrotra, Maanav (Virginia Tech, 2020-09-15)In the legal domain, documents of various types are created in connection with a particular case, such as testimony of people, transcripts, depositions, memos, and emails. Deposition transcripts are one such type of legal document, which consists of conversations between the different parties in the legal proceedings that are recorded by a court reporter. Court reporting has been traced back to 63 B.C. It has transformed from the initial scripts of ``Cuneiform", ``Running Script", and ``Grass Script" to Certified Access Real-time Translation (CART). Since the boom of digitization, there has been a shift to storing these in the PDF/A format. Deposition transcripts are in the form of question-answer (QA) pairs and can be quite lengthy for common people to read. This gives us a need to develop some automatic text-summarization method for the same. The present-day summarization systems do not support this form of text, entailing a need to process them. This creates a need to parse such documents and extract QA pairs as well as any relevant supporting information. These QA pairs can then be converted into complete canonical sentences, i.e., in a declarative form, from which we could extract some insights and use for further downstream tasks. This work investigates the same, as well as using deep-learning techniques for such transformations.
- Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)Chen, Xiaoyu; Wang, Haitao; Mehrotra, Maanav; Chhikara, Naman; Sun, Di (Virginia Tech, 2018-12-14)Dakota Access Pipeline Protests (known with the hashtag #NoDAPL) are grassroots movements that began in April 2016 in reaction to the approved construction of Energy Transfer Partners’ Dakota Access Pipeline in the northern United States. The NoDAPL movements produce many FaceBook messages, tweets, blogs, and news, which reflect different aspects of the NoDAPL events. The related information keeps increasing rapidly, which makes it difficult to understand the events in an efficient manner. Therefore, it is invaluable to automatically or at least semi-automatically generate short summaries based on the online available big data. Motivated by this automatic summarization need, the objective of this project is to propose a novel automatic summarization approach to efficiently and effectively summarize the topics hidden in the online big text data. Although automatic summarization has been investigated for more than 60 years since the publication of Luhn’s 1958 seminal paper, several challenges exist in summarizing online big text sets, such as large proportion of noise texts, highly redundant information, multiple latent topics, etc. Therefore, we propose an automatic framework with minimal human efforts to summarize big online text sets (~11,000 documents on NoDAPL) according to latent topics with nonrelevant information removed. This framework provides a hybrid model to combine the advantages of latent Dirichlet allocation (LDA) based extractive and deep-learning based abstractive methods. Different from semi-automatic summarization approaches such as template-based summarization, the proposed method does not require a deep understanding of the events from the practitioners to create the template nor to fill in the template by using regular expressions. During the procedure, the only human effort needed is to manually label a few (say, 100) documents as relevant and irrelevant. We evaluate the quality of the generated automatic summary with both extrinsic and intrinsic measurement. In the extrinsic subjective evaluation, we design a set of guideline questions and conduct a task-based measurement. Results show that 91.3% of sentences are within the scope of the guideline, and 69.6% of the outlined questions can be answered by reading the generated summary. The intrinsic ROUGE measurements show our entity coverage is a total of 2.6% and ROUGE L and ROUGE SU4 scores are 0.148 and 0.065. Overall, the proposed hybrid model achieves decent performance on summarizing NoDAPL events. Future work includes testing of the approach with more textual datasets for interesting topics, and investigation of topic modeling-supervised classification approach to minimize human efforts in automatic summarization. Besides, we also would like to investigate a deep learning-based recommender system for better sentence re-ranking.