Scalable and Maintainable Distributed Sequence Alignment Using Spark

Youssef, Karim; Elnady, Yusuf; Tilevich, Eli; Feng, Wu-chun

Scalable and Maintainable Distributed Sequence Alignment Using Spark

dc.contributor.author	Youssef, Karim	en
dc.contributor.author	Elnady, Yusuf	en
dc.contributor.author	Tilevich, Eli	en
dc.contributor.author	Feng, Wu-chun	en
dc.date.accessioned	2026-03-13T12:40:02Z	en
dc.date.available	2026-03-13T12:40:02Z	en
dc.date.issued	2025-07	en
dc.description.abstract	The exponential growth of genomic data presents a challenge to bioinformatics research. NCBI BLAST, a popular pairwise sequence alignment tool, does not scale with the hundreds of gigabytes (GB) of sequenced data. Therefore, mpiBLAST was widely adopted and scaled up to 65,536 processors. However, mpiBLAST is tightly coupled with an obsolete NCBI BLAST version, creating a challenge to upgrading mpiBLAST with the ever-changing NCBI BLAST code. Recent parallel BLAST implementations, like SparkBLAST, use parallelism wrappers separate from NCBI BLAST to overcome this issue. However, query partitioning, a parallel method that duplicates the genome database on each compute node, makes SparkBLAST scale poorly with databases larger than a single node's memory. Thus, no parallel BLAST utility simultaneously addresses performance, scalability, and software maintainability. To fill this gap, we introduce SparkLeBLAST, a parallel BLAST tool that uses the Spark framework and efficient data partitioning to combine mpiBLAST's performance and scalability with SparkBLAST's simplicity and maintainability. SparkLeBLAST democratizes scalable genomic analysis for domain scientists without extensive distributed computing experience. SparkLeBLAST runs up to 6.68× faster than SparkBLAST. SparkLeBLAST also accelerates taxonomic assignment of COVID-19 genomic diversity analysis by 20.9× as it speeds up the BLAST search component by 88.6× using 128 compute nodes.	en
dc.description.version	Submitted version	en
dc.format.extent	Pages 1388-1400	en
dc.format.extent	13 page(s)	en
dc.format.mimetype	application/pdf	en
dc.identifier	4 (Article number)	en
dc.identifier.doi	https://doi.org/10.1109/TCBBIO.2025.3565188	en
dc.identifier.eissn	2998-4165	en
dc.identifier.issn	2998-4165	en
dc.identifier.issue	4	en
dc.identifier.orcid	Feng, Wu-Chun [0000-0002-6015-0727]	en
dc.identifier.orcid	Tilevich, Eli [0000-0003-2415-6926]	en
dc.identifier.pmid	40811315	en
dc.identifier.uri	https://hdl.handle.net/10919/142233	en
dc.identifier.volume	22	en
dc.language.iso	en	en
dc.publisher	IEEE	en
dc.relation.uri	https://www.ncbi.nlm.nih.gov/pubmed/40811315	en
dc.rights	In Copyright	en
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/	en
dc.subject	COVID-19	en
dc.subject	BLAST	en
dc.subject	Big Data	en
dc.subject	high-performance computing	en
dc.subject	Spark	en
dc.subject	genomic diversity analysis	en
dc.title	Scalable and Maintainable Distributed Sequence Alignment Using Spark	en
dc.title.serial	IEEE Transactions on Computational Biology and Bioinformatics	en
dc.type	Article	en
dc.type.dcmitype	Text	en
dc.type.other	Article	en
dc.type.other	Journal	en
pubs.organisational-group	Virginia Tech	en
pubs.organisational-group	Virginia Tech/Engineering	en
pubs.organisational-group	Virginia Tech/Engineering/Computer Science	en
pubs.organisational-group	Virginia Tech/Faculty of Health Sciences	en
pubs.organisational-group	Virginia Tech/All T&R Faculty	en
pubs.organisational-group	Virginia Tech/Engineering/COE T&R Faculty	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: TCBB25-SparkLeBLAST.pdf
Size:: 6.98 MB
Format:: Adobe Portable Document Format
Description:: Submitted version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.5 KB
Format:: Plain Text
Description:

Download

Collections

All Faculty Deposits
Scholarly Works, Computer Science