Database Creation and Information Extraction from ETDs for CRA-E

dc.date.accessioned2013-05-18T13:49:56Zen
dc.date.available2013-05-18T13:49:56Zen
dc.date.issued2013-05-18en
dc.descriptionMicrosoft Access 2010 or 2013 is needed to view the database (*.accdb). An XML file is also provided (*.xml); this can be viewed in Notepad or an equivalent program. The Microsoft Word document (*.docx) can be opened in Microsoft Word 2013. All *.pdf files can be opened with Adobe Reader X and higher. Microsoft PowerPoint 2013 is needed for the *.pptx file.en
dc.description.abstractThis project was in support of the educational activities of the Computing Research Association (CRA-E). The main point of the project was to collect data associated with electronic theses and dissertations (ETDs) to allow determination of why graduate students in computing go into computing research. The deliverables include a database of the data extracted from the ETDs analyzed and a framework for machine learning and manual approaches to this data extraction. To accomplish these objectives, ETDs from North Carolina State University (NCSU), Florida State University (FSU), Auburn University (AU), Wake Forest University (WFU), and Virginia Tech (VT) were analyzed and results were inserted into the database. The Extensible Markup Language (XML) was decided upon as the structuring format for the data extracted from ETDs, and a tag structure was created utilizing biographical, educational, and institutional data from each ETD. Some of the tags included: author name, title of the paper, year published, undergraduate institution of the author, etc. XML was chosen because of its prevalence in the ETD field, its structural properties, and its ease of use. These tags were used to create the attributes for each entry in the database in Microsoft Access. Access was chosen mostly because of convenience and easy porting of tags into the system. However, the database could be moved into another system quite easily. Challenges that arose included missing data or insufficient information in various areas. The second deliverable took the form of instructions (pg. 4 in the report) that could be given to an Amazon Mechanical Turk user in how to extract information. These instructions were created and provided in order to increase speed and decrease errors in manual data extraction. It was found that the basic structure of most ETDs is similar and is normally in this approximate order (dependent on institution of origin): title page, table of contents, abstract, actual content, biography, acknowledgements, and resume (not normally present). In these, all but the table of contents and the paper itself contains required information for the database. The instructions provide the most common locations for each tag/attribute and alternate locations (if any were found). They also instruct the Mechanical Turk user what to do in case of missing data for each attribute.en
dc.identifier.urihttp://hdl.handle.net/10919/22060en
dc.language.isoenen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectAmazon Mechanical Turken
dc.subjectCRAen
dc.subjectMicrosoft Accessen
dc.subjectMachine learningen
dc.subjectXMLen
dc.subjectETDen
dc.subjectThesisen
dc.subjectDissertationen
dc.subjectDatabaseen
dc.subjectComputingen
dc.subjectResearchen
dc.subjectJoseph Lukeen
dc.subjectLamont Banksen
dc.titleDatabase Creation and Information Extraction from ETDs for CRA-Een
dc.typeDataseten
dc.typePresentationen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 6
Name:
etd.xml
Size:
69.64 KB
Format:
Extensible Markup Language
Description:
The schema and data from all ETDs in XML form. Open with Notepad++ or similar.
Name:
cs4624_database.accdb
Size:
2.5 MB
Format:
Unknown data format
Description:
The database (Microsoft Access 2010/2013) that holds content from all of the ETDs (in table etds-revised) and all other necessary tables, explained in the paper.
Name:
cs4624_craetds_banks_luke_final.pptx
Size:
328.67 KB
Format:
Microsoft Powerpoint XML
Description:
Final presentation for the project (Microsoft PowerPoint 2010/2013).
Loading...
Thumbnail Image
Name:
cs4624_craetds_banks_luke_final.pdf
Size:
768.46 KB
Format:
Adobe Portable Document Format
Description:
PDF version of the final presentation.
Loading...
Thumbnail Image
Name:
CS4624_project_paper.pdf
Size:
820.44 KB
Format:
Adobe Portable Document Format
Description:
PDF version of the final paper.
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: