Browsing by Author "Gruss, Richard"
Now showing 1 - 3 of 3
Results Per Page
Sort Options
- Injury prevention for older adults: A dataset of safety concern narratives from online reviews of mobility-related productsRestrepo, Felipe; Mali, Namrata; Sands, Laura P.; Abrahams, Alan; Goldberg, David M.; White, Janay; Prieto, Laura; Ractham, Peter; Gruss, Richard; Zaman, Nohel; Ehsani, Johnathon P. (Elsevier, 2022-06)Older adults are among the fastest-growing demographic groups in the United States, increasing by over a third this past decade. Consequently, the older adult consumer prod-uct market has quickly become a multi-billion-dollar in-dustry in which millions of products are sold every year. However, the rapidly growing market raises the poten-tial for an increasing number of product safety concerns and consumer product-related injuries among older adults. Recent manufacturer and consumer injury prevention efforts have begun to turn towards online reviews, as these provide valuable information from which actionable, timely intelligence can be derived and used to detect safety concerns and prevent injury. The presented dataset contains 1966 curated online product reviews from consumers, equally distributed between safety concerns and non-concerns, pertaining to product categories typically intended for older adults. Identified safety concerns were manually sub-coded across thirteen dimensions designed to capture relevant aspects of the consumer's experience with the purchased product, facilitate the safety concern identification and sub-classification process, and serve as a gold-standard, balanced dataset for text classifier learning. (c) 2022 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
- OutbreakSum: Automatic Summarization of Texts Relating to Disease OutbreaksGruss, Richard; Morgado, Daniel; Craun, Nate; Shea-Blymyer, Colin (2014-12)The goal of the fall 2014 Disease Outbreak Project (OutbreakSum) was to develop software for automatically analyzing and summarizing large collections of texts pertaining to disease outbreaks. Although our code was tested on collections about specific diseases--a small one about Encephalitis and a large one about Ebola--most of our tools would work on texts about any infectious disease, where the key information relates to locations, dates, number of cases, symptoms, prognosis, and government and healthcare organization interventions. In the course of the project, we developed a code base that performs several key Natural Language Processing (NLP) functions. Some of the tools that could potentially be useful for other Natural Language Generation (NLG) projects include: 1. A framework for developing MapReduce programs in Python that allows for local running and debugging; 2. Tools for document collection cleanup procedures such as small-file removal, duplicate-file removal (based on content hashes), sentence and paragraph tokenization, nonrelevant file removal, and encoding translation; 3. Utilities to simplify and speed up Named Entity Recognition with Stanford NER by using the Java API directly; 4. Utilities to leverage the full extent of the Stanford CoreNLP library, which include tools for parsing and coreference resolution; 5. Utilities to simplify using the OpenNLP Java library for text processing. By configuring and running a single Java class, you can use OpenNLP to perform part-of-speech tagging and named entity recognition on your entire collection in minutes. We’ve classified the tools available in OutbreakSum into four major modules: 1. Collection Processing; 2. Local Language Processing; 3. MapReduce with Apache Hadoop; 4. Summarization.
- Solr Team Project ReportGruss, Richard; Choudhury, Ananya; Komawar, Nikhil (2015-05-13)The Integrated Digital Event Archive and Library (IDEAL) is a Digital Library project that aims to collect, index, archive and provide access to digital contents related to important events, including disasters, man-made or natural. It extracts event data mostly from social media sites such as Twitter and crawls related web. However, the volume of information currently on the web on any event is enormous and highly noisy, making it extremely difficult to get all specific information. The objective of this course is to build a state-of-the-art information retrieval system in support of the IDEAL project. The class was divided into eight teams, each team being assigned a part of the project that when successfully implemented will enhance the IDEAL project’s functionality. The final product, which will be the culmination of these 8 teams’ efforts, is a fast and efficient search engine for events occurring around the world. This report describes the work completed by the Solr team as a contribution towards searching and retrieving the tweets and web pages archived by IDEAL. If we can visualize the class project as a tree structure, then Solr is the root of the tree, which builds on all other team’s efforts. Hence we actively interacted with all other teams to come up with a generic schema for the documents and their corresponding metadata to be indexed by Solr. As Solr interacts with HDFS via HBase where the data is stored, we also defined an HBase schema and configured the Lily Indexer to set up a fast communication between HBase and Solr. We batch-indexed 8.5 million of the 84 million tweets before encountering memory limitations on the single-node Solr installation. Focusing our efforts therefore on building a search experience around the small collections, we created a 3.4-million tweet collection and a 12,000-webpage collection. Our custom search, which leverages the differential field weights in Solr’s edismax Query Parser and two custom Query Components, achieved precision levels in excess of 90%.