Show simple item record

dc.contributor.authorWilson, James
dc.contributor.authorMartin, Joseph
dc.contributor.authorCruz, Rudy
dc.contributor.authorWeiler, Eric
dc.date.accessioned2018-05-11T01:10:30Z
dc.date.available2018-05-11T01:10:30Z
dc.date.issued2018-04-03
dc.identifier.urihttp://hdl.handle.net/10919/83210
dc.description.abstractThe purpose of the project was to run modern data analysis on abstracts created by the Society of Vertebrate Paleontology. The Society of Vertebrate Paleontology has a yearly convention in which members from all over the world gather together and present their studies from the appropriate year. Our client, Professor Sterling Nesbit, provided our group with a collection of abstracts dating back to 1987. Our job was to take all of the abstracts from each year and run analyses to see the trends and patterns spanning over all the years that the Society of Vertebrate Paleontology had been publishing abstracts in collections. The method the team has employed changed throughout the span of the project. In the beginning, the team planned on using Latent Dirichlet Allocation or LDA to summarize the abstracts. This would find the topics prevalent in the collection, and show the mix of those topics found in each of the abstracts. After further discussion with our client, the team decided on providing more straightforward analysis, based off graphing hierarchies in the abstracts. In order to properly run the graphing analysis on the abstracts our team had to scrape the abstracts to ensure the most useful data was not overlooked in the analysis. The process of scraping the abstracts began with removing all the hypertext markup tags from the abstract text files (which were converted from PDF). Then the team eliminated any English stop words in the text files to remove words that are not commonly needed for analysis. The next step was to customize and add words to this list of stop words, based on yearly differences. For example, in some years the Society of Vertebrate Paleontology required its members to create their abstracts referencing the United States as “The United States of America” while in other years they were required to reference it as “United States.” These slight changes required our team to alter our method of stop word elimination to be specific to each year. Once the scraping was done, the team created graphing scripts to produce graphs based off Vertebrate Paleontology hierarchies. After meeting with our client multiple times to further refine our analysis, we created the final analysis script version. These graphs helped our client visualize the patterns in findings made by the Society of Vertebrate Paleontology. The project should be further developed to automatically extract abstracts from the convention’s PDF collection, as well as some sort of update to stop words based off of the society’s yearly modifications.en_US
dc.language.isoen_USen_US
dc.publisherVirginia Techen_US
dc.rightsCC0 1.0 Universal*
dc.rights.urihttp://creativecommons.org/publicdomain/zero/1.0/*
dc.subjectPaleontologyen_US
dc.subjectWord Cloudsen_US
dc.subjectTopic Analysisen_US
dc.subjectPythonen_US
dc.subjectData Analysisen_US
dc.subjectGraphical Analysisen_US
dc.titlePaleontology Topic Trendsen_US
dc.typeDataseten_US
dc.typePresentationen_US
dc.typeReporten_US
dc.typeSoftwareen_US
dc.description.notesFiles And Descriptions 1. PaleontologyTopicTrendsReport.pdf: This is our main report 2. PaleontologyTopicTrendsPresentation.pptx: This is our final presentation of our report 3. PaleontologyTopicTrendsFigures.zip: This is a zip file containing our figures generated in our project 4. PaleontologyTopicTrendsOther.zip: This is a zip file containing our script files, as well as the skeleton directory for the abstracts, as well as some examples from our code and its usage. 5. PaleontologyTopicTrendsReport.docx: Our main report in .docx form 6. PaleontologyTopicTrendsPresenation.pdf: This our final presentation in .pdf form.en_US


Files in this item

Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

CC0 1.0 Universal
License: CC0 1.0 Universal