Xpantrac Connection with IDEAL
A zipped file containing all of the necessary code to run the various versions of Xpantrac (Yahoo Search API, Solr) (3.103Mb)
MetadataShow full item record
Title: Integrating Xpantrac into the IDEAL software suite, and applying it to identify topics for IDEAL webpages Identifying topics is useful because it allows us to easily understand what a document is about. If we organize documents into a database, we can then search through those documents using their identified topics. Previously, our client, Seungwon Yang, developed an algorithm for identifying topics in a given webpage called Xpantrac. This algorithm is based on the Expansion-Extraction approach. Consequently, it is also named after this approach. In the first part, the text of a document is used as input into Xpantrac and is expanded into relevant information using a search engine. In the second part, the topics in each document are identified, or extracted. In his prototype, Yang used a standard data set, a collection of one thousand New York Times articles, as a search database. As our CS4624 capstone project, our group was asked to modify Yang’s algorithm to search through IDEAL documents in Apache Solr. In order to accomplish this, we set up and became familiar with a Solr instance. Next, we replaced the prototype’s database with the Yahoo Search API to understand how it would work with a live search engine. Then we indexed a set of IDEAL documents into Solr and replaced the Yahoo Search API with Solr. However, the amount of documents we had previously indexed was far too few. In the end, we used Yang’s Wikipedia collection in Solr instead. This collection has approximately 4.2 million documents and counting. We were unable to connect Xpantrac to the IDEAL collection in Solr. This issue is discussed in detail later (along with a future solution). Therefore, our deliverable is Xpantrac for Yang’s Wikipedia collection in Solr along with an evaluation of the extracted topics.