VTechWorks staff will be away for the Independence Day holiday from July 4-7. We will respond to email inquiries on Monday, July 8. Thank you for your patience.
 

Xpantrac Connection with IDEAL

Abstract

Title: Integrating Xpantrac into the IDEAL software suite, and applying it to identify topics for IDEAL webpages

Identifying topics is useful because it allows us to easily understand what a document is about. If we organize documents into a database, we can then search through those documents using their identified topics.

Previously, our client, Seungwon Yang, developed an algorithm for identifying topics in a given webpage called Xpantrac. This algorithm is based on the Expansion-Extraction approach. Consequently, it is also named after this approach. In the first part, the text of a document is used as input into Xpantrac and is expanded into relevant information using a search engine. In the second part, the topics in each document are identified, or extracted. In his prototype, Yang used a standard data set, a collection of one thousand New York Times articles, as a search database.

As our CS4624 capstone project, our group was asked to modify Yang’s algorithm to search through IDEAL documents in Apache Solr. In order to accomplish this, we set up and became familiar with a Solr instance. Next, we replaced the prototype’s database with the Yahoo Search API to understand how it would work with a live search engine. Then we indexed a set of IDEAL documents into Solr and replaced the Yahoo Search API with Solr. However, the amount of documents we had previously indexed was far too few. In the end, we used Yang’s Wikipedia collection in Solr instead. This collection has approximately 4.2 million documents and counting.

We were unable to connect Xpantrac to the IDEAL collection in Solr. This issue is discussed in detail later (along with a future solution). Therefore, our deliverable is Xpantrac for Yang’s Wikipedia collection in Solr along with an evaluation of the extracted topics.

Description

The files provided include our Midterm presentation (PowerPoint, PDF), Final Presentation (PowerPoint, PDF), Final report (Word, PDF), and zipped file containing all of the code used to run the various versions of Xpantrac. We would also like to acknowledge our client, Seungwon Yang, and NSF IIS - 1319578: Integrated Digital Event Archiving and Library (IDEAL) for supporting and aiding in the completion of this project.

Keywords

xpantrac, expansion, extraction, Solr, IDEAL, topics

Citation