TR Number
Journal Title
Journal ISSN
Volume Title
Virginia Tech

The Internal Revenue Service (IRS) provides a plethora of data related to tax-exempt organizations through the publication of IRS Form 990 tax filings in Extensible Markup Language (XML) format, hosted between their website and Amazon Web Services (AWS). These data sources possess filing data beginning in tax year 2012, and ending in the most recently filed and uploaded tax year of 2020. This defines the project’s study window as 2012-2020. The primary goal of this project is to create a database of Form 990 filings to support research related to tourism offices and various other tax-exempt organizations. The primary challenge of this project is to process filings from all years within the study window and upload them to the database in a unified manner. The development of this database utilizes tools such as Jupyter Notebooks, SQLite, and various Python libraries for scraping, preprocessing, and analysis. Due to the number of different return types and the massive amount of data contained in the forms, understanding the forms in their standard format is incredibly challenging. Additionally, most documentation about 990 forms is oriented to accountants or tax experts who are well versed in financial jargon. This issue extends to the XML data files themselves, as many of the XML tags are heavily abbreviated, and cross referencing each of them with its corresponding location on Form 990 is a tedious and near impossible task. The solution to these problems lies in archiving the data but also having it accessible for use.

This project can be divided into four phases: scraping, preprocessing, uploading, and analysis. The project begins with scraping 990 filings from the two sources highlighted above. The next phase, preprocessing, involves creating a common schema and converting the XML files into Comma Separated Values (CSV) and JavaScript Object Notation (JSON) formats. This is the most difficult and lengthy phase of the project as it involves understanding the 990 filings to the greatest possible extent through both automated and manual processes. Next is the uploading phase, where the database is built and populated with the preprocessed data. Finally, queries can be made to the database for the analysis phase to extract interesting financial trends. This final phase allows the team to maximize its familiarity with the database and supports the development of extensive documentation and the users’ guide that are included in the Users’ Manual section. The result of this project comes in two forms: the aforementioned database, and a set of CSV data pertaining to the 990 filings of all tourism offices present in the XML data. The database is structured in order to maximize the breadth and depth of analysis that is made available to the project’s client and other stakeholders. These other stakeholders include fellow researchers of tourism offices, and any other business researchers who may be concerned with the financial data of non-profits and tax-exempt organizations. The database contains tables that allow users to access specific data across an organization, or multiple organizations’ Form 990 filings. These tables are complemented by overview data tables, allowing for users to locate specific organizations based on the type of business they carry out (such as tourism offices), rather than limiting users to querying based on Form 990 filing data. Finally, per the client's request, all tourism office data is separately outputted into a set of CSV files.

990, 990-EZ, Tourism, Database