Systematic Characterizations of Text Similarity in Full Text Biomedical Publications

dc.contributor.authorSun, Zhaohulen
dc.contributor.authorErrami, Mouniren
dc.contributor.authorLong, Tara C.en
dc.contributor.authorRenard, Chrisen
dc.contributor.authorChoradia, Nishanten
dc.contributor.authorGarner, Harold R.en
dc.date.accessed2014-05-07en
dc.date.accessioned2014-06-17T20:12:05Zen
dc.date.available2014-06-17T20:12:05Zen
dc.date.issued2010-09-15en
dc.description.abstractBackground: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. Methodology/Principal Findings: 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: 20.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). Conclusion/Significance: While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.en
dc.description.sponsorshipThe work was supported by the Hudson Foundation and the National Institutes of Health/National Library of Medicine (R01 grant number LM009758-01). The funders had no role in the design and conduct of the study, in the collection, analysis, and interpretation of the data, or in the preparation, review, and approval of the manuscript.en
dc.identifier.citationSun Z, Errami M, Long T, Renard C, Choradia N, et al. (2010) Systematic Characterizations of Text Similarity in Full Text Biomedical Publications. PLoS ONE 5(9): e12704. doi:10.1371/journal.pone.0012704en
dc.identifier.doihttps://doi.org/10.1371/journal.pone.0012704en
dc.identifier.issn1932-6203en
dc.identifier.urihttp://hdl.handle.net/10919/48980en
dc.identifier.urlhttp://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0012704en
dc.language.isoen_USen
dc.publisherPublic Library of Scienceen
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectArchivesen
dc.subjectDatabase searchingen
dc.subjectelectronicsen
dc.subjectinformation retrievalen
dc.subjectLinear regression analysisen
dc.subjectPhysical sciencesen
dc.subjectPublication ethicsen
dc.subjectPublication practicesen
dc.titleSystematic Characterizations of Text Similarity in Full Text Biomedical Publicationsen
dc.title.serialPLoS ONEen
dc.typeArticle - Refereeden

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
journal_pone_0012704.pdf
Size:
285.98 KB
Format:
Adobe Portable Document Format