US State Tourism

Abstract

Each state in the United States has its own state-run website, which is used as a means to attract new tourists to that location. Each of these sites is typically used to highlight any big attractions in that state. Any travel tips, facts regarding that location, blog posts, ratings from other individuals that have traveled there, or any other useful information that may attract potential tourists are also included. These websites are maintained and funded directly by occupancy taxes. Occupancy taxes are a form of state tax that an individual pays whenever one stays in a hotel or visits any attractions in that state. As such, the main goal of these websites is to attract new tourists to their location. These websites are maintained and paid for by past tourists who have visited that state.

Funding for future state tourism is determined by how many previous tourists have visited the state and paid the occupancy tax. Researchers need to be able to determine which elements of the website are most beneficial in attracting tourists. This can be determined by examining past tourism websites and looking for any patterns that would determine what worked well and what didn’t. These patterns can then be used to determine what was successful and use that information to make better-informed decisions. Our client, Dr. Florian Zach of the Howard Feiertag Department of Hospitality and Tourism Management, plans to use the historical analysis done by our team, to further help his research on trends in state tourism websites content. Different iterations of each state tourism website are stored as snapshots on the Internet Archive and can be accessed to see changes that took place in that website. Our team was given Parquet files of these snapshots for the states of California, Colorado, and Virginia dating back to 1998. The goal of the project was to assist Dr. Zach by using these Parquet files to perform data extraction and visualization on tourism patterns. This can then be expanded to other states’ tourism websites in the future. We used a combination of Python’s Pandas library, Jupyter Notebook, and BeautifulSoup to examine and extract relevant pieces of data from the given Parquet files. This data was extracted into various different categories, each with its own designated folder. These categories were raw text, images, background colors and background images, internal and external links, and meta tags. With this data sorted into the appropriate folders, we are then able to determine specific patterns such as what colored background was used the most. With our data extraction portion of this project completed along with the visualization, we hope to pass this on to future teams so that they are able to expand on our current project for the rest of the states.

Description
Keywords
Python, Data Analytics, Visualizaiton, BeautifulSoup, pyarrow, Jupyter Notebook, Matplotlib, Tourism, Web scraping
Citation