Browsing by Author "Galad, Andrej"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- ArchiveSpark - MS Independent Study Final SubmissionGalad, Andrej (Virginia Tech, 2016-12-13)This project expands upon the work at the Internet Archive of researcher Vinay Goel and of Jefferson Bailey (co-PI on two NSF-funded collaborative projects with Virginia Tech: IDEAL, GETAR) on the ArchiveSpark project - a framework for efficient Web archive access, extraction, and derivation. The main goal of the project is to quantitatively and qualitatively evaluate ArchiveSpark against mainstream Web archive processing solutions and extend it as necessary with regard to the processing of testing collections. This also relates to an IMLS funded project. This report describes the efforts and contributions made as part of this project. The primary focus of these efforts lies in the comprehensive evaluation of ArchiveSpark against existing archive-processing solutions (pure Apache Spark with pre-installed Warcbase tools and HBase) in a variety of environments and setups in order to comparatively analyze performance improvements that ArchiveSpark brings to the table as well as understand the shortcomings and tradeoffs of its usage under varying scenarios.
- Are Repositories Impeding Big Data Reuse?Xie, Zhiwu; Galad, Andrej; Chen, Yinlin; Fox, Edward A. (Virginia Tech, 2016-06-14)In this intentionally provocative presentation, we question the scalability of popular digital repositories and whether they are suitable for big data reuse. Are the layers of API these repositories have painted over file system primitives necessary? How essential is it for the repository to insist on being the sole manager of the content, and arranging files in ways to prevent access other than from their own APIs? We explore these questions from the perspective of big data reuse, and describe controlled reuse experiments against Fedora 4 to evaluate the cost of these practices.
- CS6604 Spring 2017 Global Events Team ProjectLi, Liuqing; Harb, Islam; Galad, Andrej (Virginia Tech, 2017-05-03)This submission describes the work the Global Events team completed in Spring 2017. It includes the final report and presentation, as well as key relevant materials (source code). Based on the previous reports and different modules created by former teams, the Global Events team established a pipeline for processing Web ARChives supporting the IDEAL and GETAR projects, both funded by NSF. With the Internet Archive’s help, the Global Events team enhanced the Event Focused Crawler to retrieve more relevant webpages (i.e., about school shooting events) in WARC format. ArchiveSpark, an Apache Spark framework that facilitates access to Web Archives, was deployed on a stand-alone server, and multiple techniques, such as parsing, Stanford NER, regular expression and statistical methods, were leveraged to process and analyze the data, and describe those events. For the data visualization, an integrated user interface using Gradle was designed and implemented for trend results, which can be easily used by both CS and non-CS researchers and students. Moreover, new well written manuals could be easier for users and developers to read and get familiar with ArchiveSpark, Spark, and Scala.
- Solr Project with IDEAL, in CS5604 (Information Storage and Retrieval)Xia, Long; Jiang, Tingting; Galad, Andrej; Maharshi, Shivam (2016-05-04)This submission describes the work of the Solr team as part of the IDEAL project with the main goal of designing and developing a distributed search infrastructure. It includes the project reports, final presentations, as well as the solutions (configuration files & Java code) developed. The main responsibility of our team was to configure Near Real Time Indexing and implement Custom Ranking for tweets and web page collections. The idea behind NRT Indexing is to help perform incremental updates from an HBase table into the Solr index, thereby optimizing time utilized and compute resources. The main motivation behind the Custom Ranking solution is to improve system precision and recall by transforming user queries with the use of the metadata provided by the other teams. The implementation leverages these three techniques: Query Expansion, Psuedo Relevance Feedback and Query Boosting. Throughout the semester we closely collaborated with several other teams both in getting requirements and the input data.