Destination Area: Data and Decisions (D&D)

Permanent URI for this collection

The DA Data and Decisions advances the human condition and society with better decisions through data. D&D integrates all DAs and SGAs with data analytics and decision sciences.  Work in this area embraces equity in the human condition by seeking the equitable distribution and availability of physical safety and well-being, psychological well-being, respect for human dignity, and access to crucial material and social resources throughout the world’s diverse communities. D&D also addresses policymaking and policy analysis, collaborating at the intersection of scientific evidence, governance, and analyses to translate scholarship into practice.

Browse

Recent Submissions

Now showing 1 - 20 of 134
  • Exploiting big data for customer and retailer benefits: A study of emerging mobile checkout scenarios
    Aloysius, John A.; Hoehle, Hartmut; Venkatesh, Viswanath (Emerald, 2016-01-01)
    Purpose – Mobile checkout in the retail store has the promise to be a rich source of big data. It is also a means to increase the rate at which big data flows into an organization as well as the potential to integrate product recommendations and promotions in real time. However, despite efforts by retailers to implement this retail innovation, adoption by customers has been slow. The paper aims to discuss these issues. Design/methodology/approach – Based on interviews and focus groups with leading retailers, technology providers, and service providers, the authors identified several emerging in-store mobile scenarios; and based on customer focus groups, the authors identified potential drivers and inhibitors of use. Findings – A first departure from the traditional customer checkout process flow is that a mobile checkout involves two processes: scanning and payment, and that checkout scenarios with respect to each of these processes varied across two dimensions: first, location – whether they were fixed by location or mobile; and second, autonomy – whether they were assisted by store employees or unassisted. The authors found no evidence that individuals found mobile scanning to be either enjoyable or to have utilitarian benefit. The authors also did not find greater privacy concerns with mobile payments scenarios. The authors did, however, in the post hoc analysis find that mobile unassisted scanning was preferred to mobile assisted scanning. The authors also found that mobile unassisted scanning with fixed unassisted checkout was a preferred service mode, while there was evidence that mobile assisted scanning with mobile assisted payment was the least preferred checkout mode. Finally, the authors found that individual differences including computer self-efficacy, personal innovativeness, and technology anxiety were strong predictors of adoption of mobile scanning and payment scenarios. Originality/value – The work helps the authors understand the emerging mobile checkout scenarios in the retail environment and customer reactions to these scenarios.
  • Prediction of condition-specific regulatory genes using machine learning
    Song, Qi; Lee, Jiyoung; Akter, Shamima; Rogers, Matthew; Grene, Ruth; Li, Song (Oxford University Press, 2020-06-19)
    Recent advances in genomic technologies have generated data on large-scale protein–DNA interactions and open chromatin regions for many eukaryotic species. How to identify condition-specific functions of transcription factors using these data has become a major challenge in genomic research. To solve this problem, we have developed a method called ConSReg, which provides a novel approach to integrate regulatory genomic data into predictive machine learning models of key regulatory genes. Using Arabidopsis as a model system, we tested our approach to identify regulatory genes in data sets from single cell gene expression and from abiotic stress treatments. Our results showed that ConSReg accurately predicted transcription factors that regulate differentially expressed genes with an average auROC of 0.84, which is 23.5–25% better than enrichment-based approaches. To further validate the performance of ConSReg, we analyzed an independent data set related to plant nitrogen responses. ConSReg provided better rankings of the correct transcription factors in 61.7% of cases, which is three times better than other plant tools. We applied ConSReg to Arabidopsis single cell RNA-seq data, successfully identifying candidate regulatory genes that control cell wall formation. Our methods provide a new approach to define candidate regulatory genes using integrated genomic data in plants.
  • Decision-adjusted driver risk predictive models using kinematics information
    Mao, Huiying; Guo, Feng; Deng, Xinwei; Doerzaph, Zachary R. (Elsevier, 2021-06)
    Accurate prediction of driving risk is challenging due to the rarity of crashes and individual driver heterogeneity. One promising direction of tackling this challenge is to take advantage of telematics data, increasingly available from connected vehicle technology, to obtain dense risk predictors. In this work, we propose a decision-adjusted framework to develop optimal driver risk prediction models using telematics-based driving behavior information. We apply the proposed framework to identify the optimal threshold values for elevated longitudinal acceleration (ACC), deceleration (DEC), lateral acceleration (LAT), and other model parameters for predicting driver risk. The Second Strategic Highway Research Program (SHRP 2) naturalistic driving data were used with the decision rule of identifying the top 1% to 20% of the riskiest drivers. The results show that the decision-adjusted model improves prediction precision by 6.3% to 26.1% compared to a baseline model using non-telematics predictors. The proposed model is superior to models based on a receiver operating characteristic curve criterion, with 5.3% and 31.8% improvement in prediction precision. The results confirm that the optimal thresholds for ACC, DEC and LAT are sensitive to the decision rules, especially when predicting a small percentage of high-risk drivers. This study demonstrates the value of kinematic driving behavior in crash risk prediction and the necessity for a systematic approach for extracting prediction features. The proposed method can benefit broad applications, including fleet safety management, use-based insurance, driver behavior intervention, as well as connected-vehicle safety technology development.
  • Sensory Descriptor Analysis of Whisky Lexicons through the Use of Deep Learning
    Miller, Chreston; Hamilton, Leah; Lahne, Jacob (MDPI, 2021-07-14)
    This paper is concerned with extracting relevant terms from a text corpus on whisk(e)y. “Relevant” terms are usually contextually defined in their domain of use. Arguably, every domain has a specialized vocabulary used for describing things. For example, the field of Sensory Science, a sub-field of Food Science, investigates human responses to food products and differentiates “descriptive” terms for flavors from “ordinary”, non-descriptive language. Within the field, descriptors are generated through Descriptive Analysis, a method wherein a human panel of experts tastes multiple food products and defines descriptors. This process is both time-consuming and expensive. However, one could leverage existing data to identify and build a flavor language automatically. For example, there are thousands of professional and semi-professional reviews of whisk(e)y published on the internet, providing abundant descriptors interspersed with non-descriptive language. The aim, then, is to be able to automatically identify descriptive terms in unstructured reviews for later use in product flavor characterization. We created two systems to perform this task. The first is an interactive visual tool that can be used to tag examples of descriptive terms from thousands of whisky reviews. This creates a training dataset that we use to perform transfer learning using GloVe word embeddings and a Long Short-Term Memory deep learning model architecture. The result is a model that can accurately identify descriptors within a corpus of whisky review texts with a train/test accuracy of 99% and precision, recall, and F1-scores of 0.99. We tested for overfitting by comparing the training and validation loss for divergence. Our results show that the language structure for descriptive terms can be programmatically learned.
  • Observing a Global Pandemic from Space: Evaluating Participatory Geographic Information Systems (PGIS) during the SARS-CoV-2 Pandemic
    DuChesne, Danielle (Virginia Tech, 2021-04-30)
    When the novel SARS-CoV-2 virus emerged in December 2019, GIS technologies and web-based GIS dashboards were rapidly employed to share information regarding disease spread and impact on society. As GIS-based tools are capable of providing spatial complexity, interactivity, and interconnectedness, its growth in popularity to help solve multifaceted problems has also grown. These efforts from citizens and scientists alike to engage in Participatory GIS (PGIS) were essential for timely and effective epidemic monitoring and response. However, the original intent of PGIS to involve the public in geographical mapping to uncover context-sensitive place-based information (Brown & Kyttä, 2014) has also created discrepancies such as ignoring the sociopolitical context of data and disregards common geovisualization best practices. The goal of this poster aims to evaluate the challenges of PGIS in analyzing data as it was used during the current global pandemic by exploring COVIDPoops19, a PGIS dashboard tracking wastewater testing as well as describing potential solutions from interdisciplinary frameworks that allow for better decision making, planning, and community action.
  • An Unsupervised Probabilistic Method for Large Scale Flood Mapping: Exploring Archive of Sentinel-1A/B Satellites over India
    Sherpa, Sonam Futi (Virginia Tech, 2021-04-30)
    Synthetic aperture radar (SAR) imaging provides an all-weather sensing technique that is suitable for near-real-time mapping of disasters such as floods. In this article, I use SAR data acquired by Sentinel-1A/B satellites to investigate a flood event that affected the Indian state of Kerala in August 2018. I apply a Bayesian approach to generate probabilistic flood maps, which contain for each pixel its probability to be flooded rather than binary flood information. I find that the extent of the flooded area begins to increase throughout Kerala after August 8, with the highest values on August 9 and August 21. I observe no apparent correlation between the spatial distributions of the flooded areas and the rainfall amounts at the district level of the study area. Instead, larger flooded areas are in the districts of Alappuzha and Kottayam, located in the downstream floodplain of the Idduki dam, which released a significant volume of water on August 16. The lack of apparent correlation is likely due to two reasons: first, there is often some delay between the rainfall event and the flooding, especially for rather large catchments where flood waves need some time to reach floodplains from higher elevations. Second, rainfall is more abundant at overhead catchments (hills and mountains), whereas flood occurs further downstream in the floodplains. Further comparison of our SAR-based flood maps with optical data and flood maps produced by moderate resolution imaging spectroradiometer highlights the advantages of our data and approach for rapid response purposes and future flood forecasting.
  • Sensing Earth and environment dynamics by telecommunication fiber-optic sensors: an urban experiment in Pennsylvania, USA
    Zhu, Tieyuan; Shen, Junzhu; Martin, Eileen R. (Copernicus Publications, 2021-01-28)
    Continuous seismic monitoring of the Earth’s near surface (top 100 m), especially with improved resolution and extent of data both in space and time, would yield more accurate insights about the effect of extreme-weather events (e.g., flooding or drought) and climate change on the Earth’s surface and subsurface systems. However, continuous long-term seismic monitoring, especially in urban areas, remains challenging. We describe the Fiber Optic foR Environmental SEnsEing (FORESEE) project in Pennsylvania, USA, the first continuous-monitoring distributed acoustic sensing (DAS) fiber array in the eastern USA. This array is made up of nearly 5 km of pre-existing dark telecommunication fiber underneath the Pennsylvania State University campus. A major thrust of this experiment is the study of urban geohazard and hydrological systems through near-surface seismic monitoring. Here we detail the FORESEE experiment deployment and instrument calibration, and describe multiple observations of seismic sources in the first year. We calibrate the array by comparison to earthquake data from a nearby seismometer and to active-source geophone data. We observed a wide variety of seismic signatures in our DAS recordings: natural events (earthquakes and thunderstorms) and anthropogenic events (mining blasts, vehicles, music concerts and walking steps). Preliminary analysis of these signals suggests DAS has the capability to sense broadband vibrations and discriminate between seismic signatures of different quakes and anthropogenic sources. With the success of collecting 1 year of continuous DAS recordings, we conclude that DAS along with telecommunication fiber will potentially serve the purpose of continuous near-surface seismic monitoring in populated areas.
  • A Specialized Data Crawler for Cross-Laminated Timber Information Resources
    Thomas, Ed; Espinoza, Omar A.; Bora, Rahul; Buehlmann, Urs (2020)
    The Internet is composed of more than 6.2 billion Web pages and grows larger every day. As the number of links and specialty subject areas grows, it becomes ever more difficult to find pertinent information. For some subject areas, special-purpose data crawlers continually search the Internet for specific information; examples include real estate, air travel, auto sales, and others. The use of such special-purpose data crawlers (i.e., targeted crawlers and knowledge databases) also allows the collection and analysis of agricultural and forestry data. Such single-purpose crawlers can search for hundreds of key words and use machine learning to determine if what is found is relevant. In this article, we examine the design and data return of such a specialty knowledge database and crawler system developed to find information related to cross-laminated timber (CLT). Our search engine uses intelligent software to locate and update pertinent references related to CLT as well as to categorize information with respect to common application and interest areas. At the time of this publication, the CLT knowledge database has cataloged nearly 3,000 publications regarding various aspects of CLT.
  • Community-Driven Metadata Standards for Agricultural Microbiome Research
    Dundore-Arias, Jose Pablo; Eloe-Fadrosh, Emiley A.; Schriml, Lynn M.; Beattie, Gwyn A.; Brennan, Fiona P.; Busby, Posy E.; Calderon, Rosalie B.; Castle, Sarah C.; Emerson, Joanne B.; Everhart, Sydney E.; Eversole, Kellye; Frost, Kenneth E.; Herr, Joshua R.; Huerta, Alejandra I.; Iyer-Pascuzzi, Anjali S.; Kalil, Audrey K.; Leach, Jan E.; Leonard, J.; Maul, Jude E.; Prithiviraj, Bharath; Potrykus, Marta; Redekar, Neelam R.; Rojas, J. Alejandro; Silverstein, Kevin A. T.; Tomso, Daniel J.; Tringe, Susannah G.; Vinatzer, Boris A.; Kinkel, Linda L. (2020-02-20)
    Accelerating the pace of microbiome science to enhance crop productivity and agroecosystem health will require transdisciplinary studies, comparisons among datasets, and synthetic analyses of research from diverse crop management contexts. However, despite the widespread availability of crop-associated microbiome data, variation in field sampling and laboratory processing methodologies, as well as metadata collection and reporting, significantly constrains the potential for integrative and comparative analyses. Here we discuss the need for agriculture-specific metadata standards for microbiome research, and propose a list of "required" and "desirable" metadata categories and ontologies essential to be included in a future minimum information metadata standards checklist for describing agricultural microbiome studies. We begin by briefly reviewing existing metadata standards relevant to agricultural microbiome research, and describe ongoing efforts to enhance the potential for integration of data across research studies. Our goal is not to delineate a fixed list of metadata requirements. Instead, we hope to advance the field by providing a starting point for discussion, and inspire researchers to adopt standardized procedures for collecting and reporting consistent and well-annotated metadata for agricultural microbiome research.
  • Finding What Is Inaccessible: Antimicrobial Resistance Language Use among the One Health Domains
    Wind, Lauren L.; Briganti, Jonathan; Brown, Anne M.; Neher, Timothy P.; Davis, Meghan F.; Durso, Lisa M.; Spicer, Tanner; Lansing, Stephanie (MDPI, 2021-04-03)
    The success of a One Health approach to combating antimicrobial resistance (AMR) requires effective data sharing across the three One Health domains (human, animal, and environment). To investigate if there are differences in language use across the One Health domains, we examined the peer-reviewed literature using a combination of text data mining and natural language processing techniques on 20,000 open-access articles related to AMR and One Health. Evaluating AMR key term frequency from the European PubMed Collection published between 1990 and 2019 showed distinct AMR language usage within each domain and incongruent language usage across domains, with significant differences in key term usage frequencies when articles were grouped by the One Health sub-specialties (2-way ANOVA; p < 0.001). Over the 29-year period, “antibiotic resistance” and “AR” were used 18 times more than “antimicrobial resistance” and “AMR”. The discord of language use across One Health potentially weakens the effectiveness of interdisciplinary research by creating accessibility issues for researchers using search engines. This research was the first to quantify this disparate language use within One Health, which inhibits collaboration and crosstalk between domains. We suggest the following for authors publishing AMR-related research within the One Health context: (1) increase title/abstract searchability by including both antimicrobial and antibiotic resistance related search terms; (2) include “One Health” in the title/abstract; and (3) prioritize open-access publication.
  • AgroSeek: a system for computational analysis of environmental metagenomic data and associated metadata
    Liang, Xiao; Akers, Kyle; Keenum, Ishi M.; Wind, Lauren L.; Gupta, Suraj; Chen, Chaoqi; Aldaihani, Reem; Pruden, Amy; Zhang, Liqing; Knowlton, Katharine F.; Xia, Kang; Heath, Lenwood S. (2021-03-10)
    Background Metagenomics is gaining attention as a powerful tool for identifying how agricultural management practices influence human and animal health, especially in terms of potential to contribute to the spread of antibiotic resistance. However, the ability to compare the distribution and prevalence of antibiotic resistance genes (ARGs) across multiple studies and environments is currently impossible without a complete re-analysis of published datasets. This challenge must be addressed for metagenomics to realize its potential for helping guide effective policy and practice measures relevant to agricultural ecosystems, for example, identifying critical control points for mitigating the spread of antibiotic resistance. Results Here we introduce AgroSeek, a centralized web-based system that provides computational tools for analysis and comparison of metagenomic data sets tailored specifically to researchers and other users in the agricultural sector interested in tracking and mitigating the spread of ARGs. AgroSeek draws from rich, user-provided metagenomic data and metadata to facilitate analysis, comparison, and prediction in a user-friendly fashion. Further, AgroSeek draws from publicly-contributed data sets to provide a point of comparison and context for data analysis. To incorporate metadata into our analysis and comparison procedures, we provide flexible metadata templates, including user-customized metadata attributes to facilitate data sharing, while maintaining the metadata in a comparable fashion for the broader user community and to support large-scale comparative and predictive analysis. Conclusion AgroSeek provides an easy-to-use tool for environmental metagenomic analysis and comparison, based on both gene annotations and associated metadata, with this initial demonstration focusing on control of antibiotic resistance in agricultural ecosystems. Agroseek creates a space for metagenomic data sharing and collaboration to assist policy makers, stakeholders, and the public in decision-making. AgroSeek is publicly-available at https://agroseek.cs.vt.edu/ .
  • A Classifier to Detect Informational vs. Non-Informational Heart Attack Tweets
    Karajeh, Ola; Darweesh, Dirar; Darwish, Omar; Abu-El-Rub, Noor; Alsinglawi, Belal; Alsaedi, Nasser (MDPI, 2021-01-16)
    Social media sites are considered one of the most important sources of data in many fields, such as health, education, and politics. While surveys provide explicit answers to specific questions, posts in social media have the same answers implicitly occurring in the text. This research aims to develop a method for extracting implicit answers from large tweet collections, and to demonstrate this method for an important concern: the problem of heart attacks. The approach is to collect tweets containing “heart attack” and then select from those the ones with useful information. Informational tweets are those which express real heart attack issues, e.g., “Yesterday morning, my grandfather had a heart attack while he was walking around the garden.” On the other hand, there are non-informational tweets such as “Dropped my iPhone for the first time and almost had a heart attack.” The starting point was to manually classify around 7000 tweets as either informational (11%) or non-informational (89%), thus yielding a labeled dataset to use in devising a machine learning classifier that can be applied to our large collection of over 20 million tweets. Tweets were cleaned and converted to a vector representation, suitable to be fed into different machine-learning algorithms: Deep neural networks, support vector machine (SVM), J48 decision tree and naïve Bayes. Our experimentation aimed to find the best algorithm to use to build a high-quality classifier. This involved splitting the labeled dataset, with 2/3 used to train the classifier and 1/3 used for evaluation besides cross-validation methods. The deep neural network (DNN) classifier obtained the highest accuracy (95.2%). In addition, it obtained the highest F1-scores with (73.6%) and (97.4%) for informational and non-informational classes, respectively.
  • Applying GIS and Text Mining Methods to Twitter Data to Explore the Spatiotemporal Patterns of Topics of Interest in Kuwait
    G. Almatar, Muhammad; Alazmi, Huda S.; Li, Liuqing; Fox, Edward A. (MDPI, 2020-11-25)
    Researchers have developed various approaches for exploring the spatial information, temporal patterns, and Twitter content in topics of interest in order to generate a better understanding of human behavior; however, few investigations have integrated these three dimensions simultaneously. This study analyzes the content of tweets in order to conduct a spatiotemporal exploration of the main topics of interest in Kuwait in order to provide a deeper understanding of the topics people think about, when they think about them, and where they tweet about them. To this end, we collect, process, and analyze tweets from nearly 120 areas in Kuwait over a 10-month period. The study’s results indicate that religion, emotions, education, and public policy are the most popular topics of interest in Kuwait. Regarding the spatiotemporal analysis, people post more tweets regarding religion on Fridays, a holy day for Muslims in Kuwait. Moreover, people are more likely to tweet about policy and education on weekdays rather than weekends. In contrast, people tweet about emotional expressions more often on weekends. From the spatial perspectives, spatial clustering in topics occurs across the days of the week. The findings are applicable to further topic analysis and similar research in other countries.
  • Using artificial intelligence for improving stroke diagnosis in emergency departments: a practical framework
    Abedi, Vida; Khan, Ayesha; Chaudhary, Durgesh; Misra, Debdipto; Avula, Venkatesh; Mathrawala, Dhruv; Kraus, Chadd; Marshall, Kyle A.; Chaudhary, Nayan; Li, Xiao; Schirmer, Clemens M.; Scalzo, Fabien; Li, Jiang; Zand, Ramin (2020-08)
    Stroke is the fifth leading cause of death in the United States and a major cause of severe disability worldwide. Yet, recognizing the signs of stroke in an acute setting is still challenging and leads to loss of opportunity to intervene, given the narrow therapeutic window. A decision support system using artificial intelligence (AI) and clinical data from electronic health records combined with patients' presenting symptoms can be designed to support emergency department providers in stroke diagnosis and subsequently reduce the treatment delay. In this article, we present a practical framework to develop a decision support system using AI by reflecting on the various stages, which could eventually improve patient care and outcome. We also discuss the technical, operational, and ethical challenges of the process.
  • Combining expert and crowd-sourced training data to map urban form and functions for the continental US
    Demuzere, Matthias; Hankey, Steven C.; Mills, Gerald; Zhang, Wenwen; Lu, Tianjun; Bechtel, Benjamin (2020-08-11)
    Although continental urban areas are relatively small, they are major drivers of environmental change at local, regional and global scales. Moreover, they are especially vulnerable to these changes owing to the concentration of population and their exposure to a range of hydro-meteorological hazards, emphasizing the need for spatially detailed information on urbanized landscapes. These data need to be consistent in content and scale and provide a holistic description of urban layouts to address different user needs. Here, we map the continental United States into Local Climate Zone (LCZ) types at a 100 m spatial resolution using expert and crowd-sourced information. There are 10 urban LCZ types, each associated with a set of relevant variables such that the map represents a valuable database of urban properties. These data are benchmarked against continental-wide existing and novel geographic databases on urban form. We anticipate the dataset provided here will be useful for researchers and practitioners to assess how the configuration, size, and shape of cities impact the important human and environmental outcomes.
  • Mobile phone use is associated with higher smallholder agricultural productivity in Tanzania, East Africa
    Quandt, Amy; Salerno, Jonathan D.; Neff, Jason C.; Baird, Timothy D.; Herrick, Jeffrey E.; McCabe, J. Terrence; Xu, Emilie; Hartter, Joel (PLOS, 2020-08-06)
    Mobile phone use is increasing in Sub-Saharan Africa, spurring a growing focus on mobile phones as tools to increase agricultural yields and incomes on smallholder farms. However, the research to date on this topic is mixed, with studies finding both positive and neutral associations between phones and yields. In this paper we examine perceptions about the impacts of mobile phones on agricultural productivity, and the relationships between mobile phone use and agricultural yield. We do so by fitting multilevel statistical models to data from farmer-phone owners (n = 179) in 4 rural communities in Tanzania, controlling for site and demographic factors. Results show a positive association between mobile phone use for agricultural activities and reported maize yields. Further, many farmers report that mobile phone use increases agricultural profits (67% of respondents) and decreases the costs (50%) and time investments (47%) of farming. Our findings suggest that there are opportunities to target policy interventions at increasing phone use for agricultural activities in ways that facilitate access to timely, actionable information to support farmer decision making.
  • DeepMicro: deep representation learning for disease prediction based on microbiome data
    Oh, Min; Zhang, Liqing (Nature Research, 2020)
    Human microbiota plays a key role in human health and growing evidence supports the potential use of microbiome as a predictor of various diseases. However, the high-dimensionality of microbiome data, often in the order of hundreds of thousands, yet low sample sizes, poses great challenge for machine learning-based prediction algorithms. This imbalance induces the data to be highly sparse, preventing from learning a better prediction model. Also, there has been little work on deep learning applications to microbiome data with a rigorous evaluation scheme. To address these challenges, we propose DeepMicro, a deep representation learning framework allowing for an effective representation of microbiome profiles. DeepMicro successfully transforms high-dimensional microbiome data into a robust low-dimensional representation using various autoencoders and applies machine learning classification algorithms on the learned representation. In disease prediction, DeepMicro outperforms the current best approaches based on the strain-level marker profile in five different datasets. In addition, by significantly reducing the dimensionality of the marker profile, DeepMicro accelerates the model training and hyperparameter optimization procedure with 8X–30X speedup over the basic approach. DeepMicro is freely available at https://github.com/minoh0201/DeepMicro.
  • Enhancing big data in the social sciences with crowdsourcing: Data augmentation practices, techniques, and opportunities
    Porter, Nathaniel D.; Verdery, Ashton M.; Gaddis, S. Michael (2020-06-10)
    Proponents of big data claim it will fuel a social research revolution, but skeptics challenge its reliability and decontextualization. The largest subset of big data is not designed for social research. Data augmentation-systematic assessment of measurement against known quantities and expansion of extant data with new information-is an important tool to maximize such data's validity and research value. Using trained research assistants or specialized algorithms are common approaches to augmentation but may not scale to big data or appease skeptics. We consider a third alternative: data augmentation with online crowdsourcing. Three empirical cases illustrate strengths and limitations of crowdsourcing, using Amazon Mechanical Turk to verify automated coding, link online databases, and gather data on online resources. Using these, we develop best practice guidelines and a reporting template to enhance reproducibility. Carefully designed, correctly applied, and rigorously documented crowdsourcing help address concerns about big data's usefulness for social research.
  • The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities
    Davis, James J.; Wattam, Alice R.; Aziz, Ramy K.; Brettin, Thomas; Butler, Ralph; Butler, Rory M.; Chlenski, Philippe; Conrad, Neal; Dickerman, Allan W.; Dietrich, Emily M.; Gabbard, Joseph L.; Gerdes, Svetlana; Guard, Andrew; Kenyon, Ronald W.; Machi, Dustin; Mao, Chunhong; Murphy-Olson, Daniel E.; Nguyen, Marcus; Nordberg, Eric K.; Olsen, Gary J.; Olson, Robert D.; Overbeek, Jamie C.; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomas, Chris; VanOeffelen, Margo; Vonstein, Veronika; Warren, Andrew S.; Xia, Fangfang; Xie, Dawen; Yoo, Hyunseung; Stevens, Rick L. (2020-01-08)
    The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services. Here we report the recent updates to the PATRIC resource, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.
  • Association of Blood Biomarkers With Acute Sport-Related Concussion in Collegiate Athletes: Findings From the NCAA and Department of Defense CARE Consortium
    McCrea, Michael A.; Broglio, Steven P.; McAllister, Thomas W.; Gill, Jessica M.; Giza, Christopher C.; Huber, Daniel L.; Harezlak, Jaroslaw; Cameron, Kenneth L.; Houston, Megan N.; McGinty, Gerald T.; Jackson, Jonathan C.; Guskiewicz, Kevin M.; Mihalik, Jason P.; Brooks, M. Alison; Duma, Stefan M.; Rowson, Steven; Nelson, Lindsay D.; Pasquina, Paul; Meier, Timothy B.; Foroud, Tatiana; Katz, Barry P.; Saykin, Andrew J.; Campbell, Darren E.; Svoboda, Steven J.; Goldman, Joshua T.; DiFiori, John P. (2020-01-24)
    Question Is sport-related concussion associated with levels of traumatic brain injury biomarkers in collegiate athletes? Findings In this case-control study of 504 collegiate athletes with concussion, contact sport control athletes, and non-contact sport athletes, the athletes with concussion had significant elevations in multiple traumatic brain injury biomarkers compared with preseason baseline and with 2 groups of control athletes without concussion during the acute postinjury period. Meaning These results suggest that blood biomarkers can be used as research tools to inform the underlying pathophysiological mechanism of concussion and provide additional support for future studies to optimize and validate biomarkers for potential clinical use in sport-related concussion. This case-control study examines the association between sport-related concussion and levels of traumatic brain injury biomarkers in collegiate athletes. Importance There is potential scientific and clinical value in validation of objective biomarkers for sport-related concussion (SRC). Objective To investigate the association of acute-phase blood biomarker levels with SRC in collegiate athletes. Design, Setting, and Participants This multicenter, prospective, case-control study was conducted by the National Collegiate Athletic Association (NCAA) and the US Department of Defense Concussion Assessment, Research, and Education (CARE) Consortium from February 20, 2015, to May 31, 2018, at 6 CARE Advanced Research Core sites. A total of 504 collegiate athletes with concussion, contact sport control athletes, and non-contact sport control athletes completed clinical testing and blood collection at preseason baseline, the acute postinjury period, 24 to 48 hours after injury, the point of reporting being asymptomatic, and 7 days after return to play. Data analysis was conducted from March 1 to November 30, 2019. Main Outcomes and Measures Glial fibrillary acidic protein (GFAP), ubiquitin C-terminal hydrolase-L1 (UCH-L1), neurofilament light chain, and tau were quantified using the Quanterix Simoa multiplex assay. Clinical outcome measures included the Sport Concussion Assessment Tool-Third Edition (SCAT-3) symptom evaluation, Standardized Assessment of Concussion, Balance Error Scoring System, and Brief Symptom Inventory 18. Results A total of 264 athletes with concussion (mean [SD] age, 19.08 [1.24] years; 211 [79.9%] male), 138 contact sport controls (mean [SD] age, 19.03 [1.27] years; 107 [77.5%] male), and 102 non-contact sport controls (mean [SD] age, 19.39 [1.25] years; 82 [80.4%] male) were included in the study. Athletes with concussion had significant elevation in GFAP (mean difference, 0.430 pg/mL; 95% CI, 0.339-0.521 pg/mL; P < .001), UCH-L1 (mean difference, 0.449 pg/mL; 95% CI, 0.167-0.732 pg/mL; P < .001), and tau levels (mean difference, 0.221 pg/mL; 95% CI, 0.046-0.396 pg/mL; P = .004) at the acute postinjury time point compared with preseason baseline. Longitudinally, a significant interaction (group x visit) was found for GFAP (F-7,F-1507.36 = 16.18, P < .001), UCH-L1 (F-7,F-1153.09 = 5.71, P < .001), and tau (F-7,F-1480.55 = 6.81, P < .001); the interaction for neurofilament light chain was not significant (F-7,F-1506.90 = 1.33, P = .23). The area under the curve for the combination of GFAP and UCH-L1 in differentiating athletes with concussion from contact sport controls at the acute postinjury period was 0.71 (95% CI, 0.64-0.78; P < .001); the acute postinjury area under the curve for all 4 biomarkers combined was 0.72 (95% CI, 0.65-0.79; P < .001). Beyond SCAT-3 symptom score, GFAP at the acute postinjury time point was associated with the classification of athletes with concussion from contact controls (beta = 12.298; 95% CI, 2.776-54.481; P = .001) and non-contact sport controls (beta = 5.438; 95% CI, 1.676-17.645; P = .005). Athletes with concussion with loss of consciousness or posttraumatic amnesia had significantly higher levels of GFAP than athletes with concussion with neither loss of consciousness nor posttraumatic amnesia at the acute postinjury time point (mean difference, 0.583 pg/mL; 95% CI, 0.369-0.797 pg/mL; P < .001). Conclusions and Relevance The results suggest that blood biomarkers can be used as research tools to inform the underlying pathophysiological mechanism of concussion and provide additional support for future studies to optimize and validate biomarkers for potential clinical use in SRC.