Virginia Tech
    • Log in
    View Item 
    •   VTechWorks Home
    • Biocomplexity Institute of Virginia Tech
    • Faculty Works, Biocomplexity Institute
    • View Item
    •   VTechWorks Home
    • Biocomplexity Institute of Virginia Tech
    • Faculty Works, Biocomplexity Institute
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Systematic Characterizations of Text Similarity in Full Text Biomedical Publications

    Thumbnail
    View/Open
    journal_pone_0012704.pdf (285.9Kb)
    Downloads: 153
    Date
    2010-09-15
    Author
    Sun, Zhaohul
    Errami, Mounir
    Long, Tara
    Renard, Chris
    Choradia, Nishant
    Garner, Harold
    Metadata
    Show full item record
    Abstract
    Background: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. Methodology/Principal Findings: 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: 20.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). Conclusion/Significance: While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.
    URI
    http://hdl.handle.net/10919/48980
    Collections
    • Faculty Works, Biocomplexity Institute [254]

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us
     

     

    VTechWorks

    AboutPoliciesHelp

    Browse

    All of VTechWorksCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    Log inRegister

    Statistics

    View Usage Statistics

    If you believe that any material in VTechWorks should be removed, please see our policy and procedure for Requesting that Material be Amended or Removed. All takedown requests will be promptly acknowledged and investigated.

    Virginia Tech | University Libraries | Contact Us