Summarizing Legal Depositions
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Documents like legal depositions are used by lawyers and paralegals to ascertain the facts pertaining to a case. These documents capture the conversation between a lawyer and a deponent, which is in the form of questions and answers. Applying current automatic summarization methods to these documents results in low-quality summaries. Though extensive research has been performed in the area of summarization, not all methods succeed in all domains. Accordingly, this research focuses on developing methods to generate high-quality summaries of depositions. As part of our work related to legal deposition summarization, we propose a solution in the form of a pipeline of components, each addressing a sub-problem; we argue that a pipeline based framework can be tuned to summarize documents from any domain. First, we developed methods to parse the depositions, accounting for different document formats. We were able to successfully parse both a proprietary and a public dataset with our methods. We next developed methods to anonymize the personal information present in the deposition documents; we achieve 95% accuracy on the anonymization using a random sampling based evaluation. Third, we developed an ontology to define dialog acts for the questions and answers present in legal depositions. Fourth, we developed classifiers based on this ontology and achieved F1-scores of 0.84 and 0.87 on the public and proprietary datasets, respectively. Fifth, we developed methods to transform a question-answer pair to a canonical/simple form. In particular, based on the dialog acts for the question and answer combination, we developed transformation methods using each of traditional NLP, and deep learning, techniques. We were able to achieve good scores on the ROUGE and semantic similarity metrics for most of the dialog act combinations. Sixth, we developed methods based on deep learning, heuristics, and machine translation to correct the transformed declarative sentences. The sentence correction improved the readability of the transformed sentences. Seventh, we developed a methodology to break a deposition into its topical aspects. An ontology for aspects was defined for legal depositions, and classifiers were developed that achieved an F1-score of 0.89. Eighth, we developed methods to segment the deposition into parts that have the same thematic context. The segments helped in augmenting candidate summary sentences with surrounding context, that leads to a more readable summary. Ninth, we developed a pipeline to integrate all of the methods, to generate summaries from the depositions. We were able to outperform the baseline and state of the art summarization methods in a majority of the cases based on the F1, Recall, and ROUGE-2 scores. The performance gains were statistically significant for all of the scores. The summaries generated by our system can be arranged based on the same thematic context or aspect and hence should be much easier to read and follow, compared to the baseline methods. As part of our future work, we will improve upon these methods. We will refine our methods to identify the important parts using additional documents related to a deposition. In addition, we will work to improve the compression ratio of the generated summaries by reducing the number of unimportant sentences. We will expand the training dataset to learn and tune the coverage of the aspects for various deponent types using empirical methods. Our system has demonstrated effectiveness in transforming a QA pair into a declarative sentence. Having such a capability could enable us to generate a narrative summary from the depositions, a first for legal depositions. We will also expand our dataset for evaluation to ensure that our methods are indeed generalizable, and that they work well when experts subjectively evaluate the quality of the deposition summaries.