LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

dc.contributor.authorTian, Longen
dc.contributor.authorMazloom, Rezaen
dc.contributor.authorHeath, Lenwood S.en
dc.contributor.authorVinatzer, Boris A.en
dc.contributor.departmentSchool of Plant and Environmental Sciencesen
dc.contributor.departmentComputer Scienceen
dc.date.accessioned2021-04-20T19:39:27Zen
dc.date.available2021-04-20T19:39:27Zen
dc.date.issued2021-03-24en
dc.description.abstractBackground: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods: Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results: LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.en
dc.description.notesThis study was supported by the National Science Foundation (IOS-1354215) and the College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University. Funding to Boris A. Vinatzer was also provided in part by the Virginia Agricultural Experiment Station and the Hatch Program of the National Institute of Food and Agriculture, US Department of Agriculture. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.en
dc.description.sponsorshipNational Science FoundationNational Science Foundation (NSF) [IOS-1354215]; College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University; Virginia Agricultural Experiment Station; Hatch Program of the National Institute of Food and Agriculture, US Department of Agricultureen
dc.format.mimetypeapplication/pdfen
dc.identifier.doihttps://doi.org/10.7717/peerj.10906en
dc.identifier.issn2167-8359en
dc.identifier.othere10906en
dc.identifier.pmid33828908en
dc.identifier.urihttp://hdl.handle.net/10919/103065en
dc.identifier.volume9en
dc.language.isoenen
dc.rightsCreative Commons Attribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.subjectProkaryotesen
dc.subjectGenome-based taxonomyen
dc.subjectAverage nucleotide identityen
dc.subjectGenomic similarityen
dc.subjectComparative genomicsen
dc.subjectPhylogenomicsen
dc.titleLINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomesen
dc.title.serialPeerJen
dc.typeArticle - Refereeden
dc.type.dcmitypeTexten
dc.type.dcmitypeStillImageen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
peerj-10906.pdf
Size:
2.58 MB
Format:
Adobe Portable Document Format
Description:
Published version