Browsing by Author "Chandrasekar, Prashant"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
- Continuously Extensible Information Systems: Extending the 5S Framework by Integrating UX and WorkflowsChandrasekar, Prashant (Virginia Tech, 2021-06-11)In Virginia Tech's Digital Library Research Laboratory, we support subject-matter-experts (SMEs) in their pursuit of research goals. Their goals include everything from data collection to analysis to reporting. Their research commonly involves an analysis of an extensive collection of data such as tweets or web pages. Without support -- such as by our lab, developers, or data analysts/scientists -- they would undertake the data analysis themselves, using available analytical tools, frameworks, and languages. Then, to extract and produce the information needed to achieve their goals, the researchers/users would need to know what sequences of functions or algorithms to run using such tools, after considering all of their extensive functionality. Our research addresses these problems directly by designing a system that lowers the information barriers. Our approach is broken down into three parts. In the first two parts, we introduce a system that supports discovery of both information and supporting services. In the first part, we describe the methodology that incorporates User eXperience (UX) research into the process of workflow design. Through the methodology, we capture (a) what are the different user roles and goals, (b) how we break down the user goals into tasks and sub-tasks, and (c) what functions and services are required to solve each (sub-)task. In the second part, we identify and describe key components of the infrastructure implementation. This implementation captures the various goals/tasks/services associations in a manner that supports information inquiry of two types: (1) Given an information goal as query, what is the workflow to derive this information? and (2) Given a data resource, what information can we derive using this data resource as input? We demonstrate both parts of the approach, describing how we teach and apply the methodology, with three case studies. In the third part of this research, we rely on formalisms used in describing digital libraries to explain the components that make up the information system. The formal description serves as a guide to support the development of information systems that generate workflows to support SME information needs. We also specifically describe an information system meant to support information goals that relate to Twitter data.
- Integrated Digital Library System for Long Documents and their ElementsChekuri, Satvik; Chandrasekar, Prashant; Banerjee, Bipasha; Park, Sung Hee; Masrourisaadat, Nila; Ahuja, Aman; Ingram, William A.; Fox, Edward A. (ACM, 2023)We describe a next-generation integrated Digital Library (DL) system that addresses the numerous goals associated with long documents such as Electronic Theses and Dissertations (ETDs). Our extensible workflow-centric design supports a variety of users/personas (e.g., researchers, curators, and experimenters) who can benefit from improved access to ETDs and the content buried therein. Our approach leverages natural language processing, deep learning, information retrieval, and software engineering methods. The services cover ingesting, storing, curating, analyzing, detecting, extracting, classifying, summarizing, topic modeling, browsing, searching, retrieving, recommending, visualizing/reporting, and interacting with ETDs and derivative text/image-based elements/objects. Workflows connect the services and their APIs, along with UI-based access. We believe our approach can guide others to combine tailored user support, research, and education by way of extensible DLs.
- Reducing Noise for IDEALWang, Xiangwen; Chandrasekar, Prashant (2015-05-12)The corpora for which we are building an information retrieval system consists of tweets and web pages (extracted from URL links that might be included in the tweets) that have been selected based on rudimentary string matching provided by the Twitter API. As a result, the corpora are inherently noisy and contain a lot of irrelevant information. This includes documents that are non-English, off topic articles and other information within them such as: stop-words, whitespace characters, non-alphanumeric characters, icons, broken links, HTML/XML tags, scripting codes, CSS style sheets, etc. In our attempt to build an efficient information retrieval system for events, through Solr, we are devising a matching system for the corpora by adding various facets and other properties to serve as dimensions for each document. These dimensions function as additional criteria that will enhance the matching and thereby the retrieval mechanism of Solr. They are metadata from classification, clustering, named-entities, topic modeling and social graph scores implemented by other teams in the class. It is of utmost importance that each of these initiatives is precise to ensure the enhancement of the matching and retrieval system. The quality of their work is dependent directly or indirectly on the quality of data that is provided to them. Noisy data will skew the results and each team would need to perform additional tasks to get rid of it prior to executing their core functionalities. It is our role and responsibility to remove irrelevant content or “noisy data” from the corpora. For both tweets and web pages, we cleaned entries that were written in English and discarded the rest. For tweets, we first extracted user handle information, URLs, and hashtags. We cleaned up the tweet text by removing non-ASCII character sequences and standardized the text using case folding, stemming and stop word removal. For the scope of this project, we considered cleaning only HTML formatted web pages and entries written in plain text file format. All other entries (or documents) such as videos, images, etc. were discarded. For the “valid” entries, we extracted the URLs within the web pages to enumerate the outgoing links. Using the Python package readability, we were able to clean advertisement, header and footer content. We were able to organize the remaining content and extract the article text using another Python package beatifulsoup4. We completed the cleanup by standardizing the text by removing non-ASCII characters, stemming, stop word removal and case folding. As a result, 14 tweet collections and 9 web pages collections were cleaned and indexed into Solr for retrieval.
- Social Communities Knowledge Discovery: Approaches applied to clinical studyChandrasekar, Prashant (Virginia Tech, 2017-05)In recent efforts being conducted by the Social Interactome team, to validate hypotheses of the study, we have worked to make sense of the data that has been collected during two 16-week experiments and three Amazon Mechanical Turk deployments. The complexity in the data has made it challenging to discover insights/patterns. The goal of the semester was to explore newer methods to analyze the data. Through such discovery, we can test/validate hypotheses about the data, that would provide a direction for our contextual inquiry to predict attributes and behavior of participants in the study. The report and slides highlight two possible approaches that employ statistical relational learning for structure learning and network classification. Related files include data and software used during this study; results are given from the analyses undertaken.
- Using Transactional Web Archives To Handle Server ErrorsXie, Zhiwu; Chandrasekar, Prashant; Fox, Edward A. (2015-06)We describe a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the website’s quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing pertinent support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.
- A UWS Case for 200-Style Memento NegotiationsXie, Zhiwu; Chandrasekar, Prashant; Fox, Edward A. (IEEE Technical Committee on Digital Libraries, 2015-10)Uninterruptible web service (UWS) is a web archiving application that handles server errors using the most recently archived representation of the requested web resource. The application is developed as an Apache module. It leverages the transactional web archiving tool SiteStory, which archives all previously accessed representations of web resources originating from a website. This application helps to improve the website’s quality of service by temporarily masking server errors from the end user and gaining precious time for the system administrator to debug and recover from server failures. By providing value-added support to website operations, we aim to reduce the resistance to transactional web archiving, which in turn may lead to a better coverage of web history.