VTechWorks staff will be away for the Independence Day holiday from July 4-7. We will respond to email inquiries on Monday, July 8. Thank you for your patience.
 

Xpantrac Connection with IDEAL

dc.contributor.authorNeidig, Sloaneen
dc.contributor.authorJohnson, Samanthaen
dc.contributor.authorCabrera, Daviden
dc.contributor.authorHoffman, Erikaen
dc.date.accessioned2014-05-09T21:22:43Zen
dc.date.available2014-05-09T21:22:43Zen
dc.date.issued2014-05-09en
dc.descriptionThe files provided include our Midterm presentation (PowerPoint, PDF), Final Presentation (PowerPoint, PDF), Final report (Word, PDF), and zipped file containing all of the code used to run the various versions of Xpantrac. We would also like to acknowledge our client, Seungwon Yang, and NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL) for supporting and aiding in the completion of this project.en
dc.description.abstractTitle: Integrating Xpantrac into the IDEAL software suite, and applying it to identify topics for IDEAL webpages Identifying topics is useful because it allows us to easily understand what a document is about. If we organize documents into a database, we can then search through those documents using their identified topics. Previously, our client, Seungwon Yang, developed an algorithm for identifying topics in a given webpage called Xpantrac. This algorithm is based on the Expansion-Extraction approach. Consequently, it is also named after this approach. In the first part, the text of a document is used as input into Xpantrac and is expanded into relevant information using a search engine. In the second part, the topics in each document are identified, or extracted. In his prototype, Yang used a standard data set, a collection of one thousand New York Times articles, as a search database. As our CS4624 capstone project, our group was asked to modify Yang’s algorithm to search through IDEAL documents in Apache Solr. In order to accomplish this, we set up and became familiar with a Solr instance. Next, we replaced the prototype’s database with the Yahoo Search API to understand how it would work with a live search engine. Then we indexed a set of IDEAL documents into Solr and replaced the Yahoo Search API with Solr. However, the amount of documents we had previously indexed was far too few. In the end, we used Yang’s Wikipedia collection in Solr instead. This collection has approximately 4.2 million documents and counting. We were unable to connect Xpantrac to the IDEAL collection in Solr. This issue is discussed in detail later (along with a future solution). Therefore, our deliverable is Xpantrac for Yang’s Wikipedia collection in Solr along with an evaluation of the extracted topics.en
dc.description.sponsorshipSeungwon Yang (syang20@gmu.edu)en
dc.description.sponsorshipNSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL)en
dc.identifier.urihttp://hdl.handle.net/10919/47941en
dc.language.isoen_USen
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/us/en
dc.subjectxpantracen
dc.subjectexpansionen
dc.subjectextractionen
dc.subjectSolren
dc.subjectIDEALen
dc.subjecttopicsen
dc.titleXpantrac Connection with IDEALen
dc.typePresentationen
dc.typeTechnical reporten

Files

Original bundle
Now showing 1 - 5 of 7
Name:
project.zip
Size:
3.1 MB
Format:
Unknown data format
Description:
A zipped file containing all of the necessary code to run the various versions of Xpantrac (Yahoo Search API, Solr)
Loading...
Thumbnail Image
Name:
Xpantrac Final Presentation.pdf
Size:
563.36 KB
Format:
Adobe Portable Document Format
Description:
PDF version of the Xpantrac final presentation
Name:
Xpantrac Final Presentation.pptx
Size:
988.61 KB
Format:
Microsoft Powerpoint XML
Description:
Original (PowerPoint) version of the Xpantrac final presentation
Loading...
Thumbnail Image
Name:
Xpantrac Midterm Presentation.pdf
Size:
550.96 KB
Format:
Adobe Portable Document Format
Description:
PDF version of the Xpantrac midterm presentation
Name:
Xpantrac Midterm Presentation.pptx
Size:
1.11 MB
Format:
Microsoft Powerpoint XML
Description:
Original (PowerPoint) version of the Xpantrac midterm presentation
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Item-specific license agreed upon to submission
Description: