Database Creation and Information Extraction from ETDs for CRA-E

Database Creation and Information Extraction from ETDs for CRA-E

Files

The schema and data from all ETDs in XML form. Open with Notepad++ or similar. (69.64 KB)

Downloads: 1040

The database (Microsoft Access 2010/2013) that holds content from all of the ETDs (in table etds-revised) and all other necessary tables, explained in the paper. (2.5 MB)

Downloads: 45

Final presentation for the project (Microsoft PowerPoint 2010/2013). (328.67 KB)

Downloads: 127

PDF version of the final presentation. (768.46 KB)

Downloads: 110

PDF version of the final paper. (820.44 KB)

Downloads: 150

Date

2013-05-18

Abstract

This project was in support of the educational activities of the Computing Research Association (CRA-E). The main point of the project was to collect data associated with electronic theses and dissertations (ETDs) to allow determination of why graduate students in computing go into computing research. The deliverables include a database of the data extracted from the ETDs analyzed and a framework for machine learning and manual approaches to this data extraction. To accomplish these objectives, ETDs from North Carolina State University (NCSU), Florida State University (FSU), Auburn University (AU), Wake Forest University (WFU), and Virginia Tech (VT) were analyzed and results were inserted into the database. The Extensible Markup Language (XML) was decided upon as the structuring format for the data extracted from ETDs, and a tag structure was created utilizing biographical, educational, and institutional data from each ETD. Some of the tags included: author name, title of the paper, year published, undergraduate institution of the author, etc. XML was chosen because of its prevalence in the ETD field, its structural properties, and its ease of use. These tags were used to create the attributes for each entry in the database in Microsoft Access. Access was chosen mostly because of convenience and easy porting of tags into the system. However, the database could be moved into another system quite easily. Challenges that arose included missing data or insufficient information in various areas. The second deliverable took the form of instructions (pg. 4 in the report) that could be given to an Amazon Mechanical Turk user in how to extract information. These instructions were created and provided in order to increase speed and decrease errors in manual data extraction. It was found that the basic structure of most ETDs is similar and is normally in this approximate order (dependent on institution of origin): title page, table of contents, abstract, actual content, biography, acknowledgements, and resume (not normally present). In these, all but the table of contents and the paper itself contains required information for the database. The instructions provide the most common locations for each tag/attribute and alternate locations (if any were found). They also instruct the Mechanical Turk user what to do in case of missing data for each attribute.

Description

Microsoft Access 2010 or 2013 is needed to view the database (*.accdb). An XML file is also provided (*.xml); this can be viewed in Notepad or an equivalent program. The Microsoft Word document (*.docx) can be opened in Microsoft Word 2013. All *.pdf files can be opened with Adobe Reader X and higher. Microsoft PowerPoint 2013 is needed for the *.pptx file.

Keywords

Amazon Mechanical Turk, CRA, Microsoft Access, Machine learning, XML, ETD, Thesis, Dissertation, Database, Computing, Research, Joseph Luke, Lamont Banks

Persistent link

http://hdl.handle.net/10919/22060

Collections

CS4624: Multimedia, Hypertext, and Information Access

Full item page