Paleontology Topic Trends

Abstract

The purpose of the project was to run modern data analysis on abstracts created by the Society of Vertebrate Paleontology. The Society of Vertebrate Paleontology has a yearly convention in which members from all over the world gather together and present their studies from the appropriate year. Our client, Professor Sterling Nesbit, provided our group with a collection of abstracts dating back to 1987. Our job was to take all of the abstracts from each year and run analyses to see the trends and patterns spanning over all the years that the Society of Vertebrate Paleontology had been publishing abstracts in collections. The method the team has employed changed throughout the span of the project. In the beginning, the team planned on using Latent Dirichlet Allocation or LDA to summarize the abstracts. This would find the topics prevalent in the collection, and show the mix of those topics found in each of the abstracts. After further discussion with our client, the team decided on providing more straightforward analysis, based off graphing hierarchies in the abstracts. In order to properly run the graphing analysis on the abstracts our team had to scrape the abstracts to ensure the most useful data was not overlooked in the analysis. The process of scraping the abstracts began with removing all the hypertext markup tags from the abstract text files (which were converted from PDF). Then the team eliminated any English stop words in the text files to remove words that are not commonly needed for analysis. The next step was to customize and add words to this list of stop words, based on yearly differences. For example, in some years the Society of Vertebrate Paleontology required its members to create their abstracts referencing the United States as “The United States of America” while in other years they were required to reference it as “United States.” These slight changes required our team to alter our method of stop word elimination to be specific to each year. Once the scraping was done, the team created graphing scripts to produce graphs based off Vertebrate Paleontology hierarchies. After meeting with our client multiple times to further refine our analysis, we created the final analysis script version. These graphs helped our client visualize the patterns in findings made by the Society of Vertebrate Paleontology. The project should be further developed to automatically extract abstracts from the convention’s PDF collection, as well as some sort of update to stop words based off of the society’s yearly modifications.

Description
Keywords
Paleontology, Word Clouds, Topic Analysis, Python, Data Analysis, Graphical Analysis
Citation