Team 4: Language Models, Classification and Summarization

dc.contributor.authorNaleshwarkar, Kanaden
dc.contributor.authorBhatambarekar, Gayatrien
dc.contributor.authorDesai, Zeelen
dc.contributor.authorKumaran, Aishwaryaen
dc.contributor.authorHaque, Shadaben
dc.contributor.authorSrinivasan Manikandan, Adithya Harishen
dc.date.accessioned2024-04-25T02:40:14Zen
dc.date.available2024-04-25T02:40:14Zen
dc.date.issued2023-12-17en
dc.descriptionTeam4ClassificationSummarizationLanguageModelsReport.pdf: Final report by Team 4; Team4ClassificationSummarizationLanguageModelsPresentation.pptx: Presentation slides for the final presentation of Team 4; Team4ClassificationSummarizationLanguageModelsPresentation.pdf: Presentation slides in PDF format; Team4ClassificationSummarizationLanguageModelsReportSource.zip: LaTeX source code for the reporten
dc.description.abstractThe CS5604 class at Virginia Tech has been tasked with developing an information retrieval and analysis system that can handle the collection of data of at least 500,000 Electronic Theses and Dissertations (ETDs), under the direction of Dr. Edward A. Fox. This program should function as a search engine with a variety of capabilities, including browsing, searching, giving suggestions, and rating search results. The class has been split into six teams to execute this job, and each team has been given a specific task. The goal of this report is to provide an overview of Team 4's contribution, which focuses on classification, summarization, and language models. Our prime tasks were testing out various models for classification and summarization. During the course of this project, we evaluated models developed by the previous team working on this task and explored various strategies to improve them. For the classification task, we fine-tuned the SciBERT model to get standardized subject category labels that are in accordance with ProQuest. We also evaluated a large language model, LLaMA 2, for the classification task, and after comparing its performance with the fine-tuned SciBERT model, we observed that LLaMA 2 was not efficient enough for a large-scale system that the class was working on. For summarization, we evaluated summaries generated by various transformer, non-transformer, and LLM-based models. The five models that we evaluated for summarization were TextRank, LexRank, LSA, BigBirdPegasus, and LLaMA 2 7B. We observed that although TextRank and BigBirdPegasus had comparable results, the summaries generated by TextRank were more comprehensive. This experimentation gave us valuable insight into the complexities of processing a large set of documents and performing tasks such as classification and summarization. Additionally, it allowed us to explore the deployment of these models in a production environment to evaluate their performance at scale.en
dc.identifier.urihttps://hdl.handle.net/10919/118663en
dc.language.isoen_USen
dc.publisherVirginia Techen
dc.rightsAttribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/en
dc.subjectClassificationen
dc.subjectSummarizationen
dc.subjectLanguage Modelsen
dc.subjectLLMsen
dc.subjectLLaMAen
dc.subjectSubject Categoriesen
dc.titleTeam 4: Language Models, Classification and Summarizationen
dc.typePresentationen
dc.typeReporten

Files

Original bundle
Now showing 1 - 4 of 4
Loading...
Thumbnail Image
Name:
Team4ClassificationSummarizationLanguageModelsPresentation.pdf
Size:
1.53 MB
Format:
Adobe Portable Document Format
Name:
Team4ClassificationSummarizationLanguageModelsPresentation.pptx
Size:
887.61 KB
Format:
Microsoft Powerpoint XML
Loading...
Thumbnail Image
Name:
Team4ClassificationSummarizationLanguageModelsReport.pdf
Size:
1.56 MB
Format:
Adobe Portable Document Format
Name:
Team4ClassificationSummarizationLanguageModelsReportSource.zip
Size:
2.84 MB
Format:
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: