Database Creation and Information Extraction from ETDs for CRA-E
Files
TR Number
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
This project was in support of the educational activities of the Computing Research Association (CRA-E). The main point of the project was to collect data associated with electronic theses and dissertations (ETDs) to allow determination of why graduate students in computing go into computing research. The deliverables include a database of the data extracted from the ETDs analyzed and a framework for machine learning and manual approaches to this data extraction. To accomplish these objectives, ETDs from North Carolina State University (NCSU), Florida State University (FSU), Auburn University (AU), Wake Forest University (WFU), and Virginia Tech (VT) were analyzed and results were inserted into the database. The Extensible Markup Language (XML) was decided upon as the structuring format for the data extracted from ETDs, and a tag structure was created utilizing biographical, educational, and institutional data from each ETD. Some of the tags included: author name, title of the paper, year published, undergraduate institution of the author, etc. XML was chosen because of its prevalence in the ETD field, its structural properties, and its ease of use. These tags were used to create the attributes for each entry in the database in Microsoft Access. Access was chosen mostly because of convenience and easy porting of tags into the system. However, the database could be moved into another system quite easily. Challenges that arose included missing data or insufficient information in various areas. The second deliverable took the form of instructions (pg. 4 in the report) that could be given to an Amazon Mechanical Turk user in how to extract information. These instructions were created and provided in order to increase speed and decrease errors in manual data extraction. It was found that the basic structure of most ETDs is similar and is normally in this approximate order (dependent on institution of origin): title page, table of contents, abstract, actual content, biography, acknowledgements, and resume (not normally present). In these, all but the table of contents and the paper itself contains required information for the database. The instructions provide the most common locations for each tag/attribute and alternate locations (if any were found). They also instruct the Mechanical Turk user what to do in case of missing data for each attribute.