AI Aided Annotation


Human annotation of long documents is a very important task in training and evaluation in NLP. The process generally starts with the human annotators reading over the document in its entirety. Once the annotator feels they have a sufficient grasp on the document, they can begin to annotate it. Specifically, annotators will look for questions that can be answered, and then write down the question and answer. In our client’s case, the chosen long documents are electronic theses and dissertations (ETDs) which are often 100-150 pages minimum, thereby making it a time consuming and expensive process to annotate. The ETDs are annotated on a chapter by chapter basis as content can vary significantly in each chapter. The annotations generated are then used to help evaluate downstream tasks such as summarization, topic modeling, and question answering.

The system aids the annotators in the creation of a Knowledge Base that is rich with topics/keywords and question-answer pairs for each chapter in ETDs. The core of the system revolves around an algorithm known as the Maximal Marginal Relevance. By utilizing the MMR algorithm with a changeable lambda value, keywords, and a couple of other elements, we can identify sentences based on their similarity or diversity relative to a collection of sentences. This algorithm would greatly enhance the annotation process in ETDs by automating the process of identifying the most relevant sentences. Thus, annotators do not have to sift through the ETDs one sentence at a time, instead making a comprehensive summary as fast as the MMR algorithm can work. As a result, annotators can save many hours per ETD, resulting in more human generated annotations in a shorter amount of time.

The final deliverables are the project, a final slideshow presenting our work throughout the semester, a final report, and a video demonstrating exactly how to use our platform. All of this is available here on VTechWorks in this report. Additionally, the project is being built using GitHub, making it free and available to the public to fork and modify in any way they see fit.



Maximal Marginal Relevance, Annotation, Website, AI Aided Annotation, Chapter Annotation