LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

Tian, Long; Mazloom, Reza; Heath, Lenwood S.; Vinatzer, Boris A.

LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

dc.contributor.author	Tian, Long	en
dc.contributor.author	Mazloom, Reza	en
dc.contributor.author	Heath, Lenwood S.	en
dc.contributor.author	Vinatzer, Boris A.	en
dc.contributor.department	School of Plant and Environmental Sciences	en
dc.contributor.department	Computer Science	en
dc.date.accessioned	2021-04-20T19:39:27Z	en
dc.date.available	2021-04-20T19:39:27Z	en
dc.date.issued	2021-03-24	en
dc.description.abstract	Background: Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods: Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results: LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.	en
dc.description.notes	This study was supported by the National Science Foundation (IOS-1354215) and the College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University. Funding to Boris A. Vinatzer was also provided in part by the Virginia Agricultural Experiment Station and the Hatch Program of the National Institute of Food and Agriculture, US Department of Agriculture. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.	en
dc.description.sponsorship	National Science FoundationNational Science Foundation (NSF) [IOS-1354215]; College of Agriculture and Life Sciences at Virginia Polytechnic Institute and State University; Virginia Agricultural Experiment Station; Hatch Program of the National Institute of Food and Agriculture, US Department of Agriculture	en
dc.format.mimetype	application/pdf	en
dc.identifier.doi	https://doi.org/10.7717/peerj.10906	en
dc.identifier.issn	2167-8359	en
dc.identifier.other	e10906	en
dc.identifier.pmid	33828908	en
dc.identifier.uri	http://hdl.handle.net/10919/103065	en
dc.identifier.volume	9	en
dc.language.iso	en	en
dc.rights	Creative Commons Attribution 4.0 International	en
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	en
dc.subject	Prokaryotes	en
dc.subject	Genome-based taxonomy	en
dc.subject	Average nucleotide identity	en
dc.subject	Genomic similarity	en
dc.subject	Comparative genomics	en
dc.subject	Phylogenomics	en
dc.title	LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes	en
dc.title.serial	PeerJ	en
dc.type	Article - Refereed	en
dc.type.dcmitype	Text	en
dc.type.dcmitype	StillImage	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: peerj-10906.pdf
Size:: 2.58 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Scholarly Works, School of Plant and Environmental Sciences
Scholarly Works, Computer Science