SeerSuite at Virginia Tech

Abstract

Problem Statement: A digital library has computer-managed collections stored in digital formats. Although scientists have researched and developed digital libraries since 1991, there has been limited research on multilingual digital libraries. This project seeks to research the anticipated needs for digital library infrastructure to support multilingual information and how this can best proceed for both Arabic and English digital content. It also will recommend and assemble the necessary tools, SeerSuite and its dependencies, to establish the digital library for the crawled data. We then will display results using a web interface. Project Procedures: · The first thing we had to do was to get a working Linux machine running and ready to install SeerSuite. We started with Ubuntu, moved to Red Hat, and after many problems with Java we ultimately switched to CentOS 6.3. Red Hat is recommended by the Seersuite developers, but we did not have a valid license so Java would not work. · After installing the operating system, we then had to configure dependencies for the OS. We installed Java, Perl, and MySQL and configured the variables for system access to these resources. · We then began installing SeerSuite dependencies such as Apache Tomcat, Apache Solr, Apache Ant, and Apache Axis2. These are all required before installing the SeerSuite package. · After all the dependencies were working, and many weeks of fighting with Solr, we began installation of SeerSuite. The installation worked pretty well and we had the web interface, CiteseerX, running then. · The next step was to import pre-parsed data. We tried for many days to get the data imported and searchable, but we could not ultimately get the data to be searchable. Results: · We were unable to do much research on multilingual support since we could not get any data imported or searchable. · We have learned that installing SeerSuite is quite difficult and it would have been nice to have Steve Carman help us, but his help was limited and not as frequent as we would have liked. Due to him not replying to emails and us having issue after issue, this installation was much more painful than it should have been. · We had our machine compromised and had to start from scratch towards the end of our project. We then worked day and night for many days and restored the machine to a working state, which shows we have learned how to install these programs, and we have laid out the plans for repeating our work in this document. Conclusion: With more time, and a responsive contact that is knowledgeable about SeerSuite, we would have been able to complete this installation. We have put a lot of effort into this, and documented all the issues we had and their solutions. With this new documentation, we are confident that future teams will be able to complete our work with more ease.

Description
SeerSuite is a tool used to index and search documents, that has been developed by Penn State. We were implementing this tool in the hopes to be able to do multilingual research with it, such as indexing documents in Arabic in Qatar and documents in English here, and then searching all of them using either English or Arabic queries. This project was installed on a CentOS machine here at Virginia Tech and was to be ported to a machine in Qatar after we got our installation finalized. This would allow the machines to each work in the same way and be able to index and support searching of multilingual documents and queries.
Keywords
Crawling, CiteSeerX, Search, Multilingual, Arabic, English, Solr, Axis2, MySQL, Index, Data, Documents, SeerSuite
Citation