Scalable and Maintainable Distributed Sequence Alignment Using Spark

Youssef, Karim; Elnady, Yusuf; Tilevich, Eli; Feng, Wu-chun

Scalable and Maintainable Distributed Sequence Alignment Using Spark

Files

Submitted version (6.98 MB)

Downloads: 80

Date

2025-07

Authors

Publisher

IEEE

Abstract

The exponential growth of genomic data presents a challenge to bioinformatics research. NCBI BLAST, a popular pairwise sequence alignment tool, does not scale with the hundreds of gigabytes (GB) of sequenced data. Therefore, mpiBLAST was widely adopted and scaled up to 65,536 processors. However, mpiBLAST is tightly coupled with an obsolete NCBI BLAST version, creating a challenge to upgrading mpiBLAST with the ever-changing NCBI BLAST code. Recent parallel BLAST implementations, like SparkBLAST, use parallelism wrappers separate from NCBI BLAST to overcome this issue. However, query partitioning, a parallel method that duplicates the genome database on each compute node, makes SparkBLAST scale poorly with databases larger than a single node's memory. Thus, no parallel BLAST utility simultaneously addresses performance, scalability, and software maintainability. To fill this gap, we introduce SparkLeBLAST, a parallel BLAST tool that uses the Spark framework and efficient data partitioning to combine mpiBLAST's performance and scalability with SparkBLAST's simplicity and maintainability. SparkLeBLAST democratizes scalable genomic analysis for domain scientists without extensive distributed computing experience. SparkLeBLAST runs up to 6.68× faster than SparkBLAST. SparkLeBLAST also accelerates taxonomic assignment of COVID-19 genomic diversity analysis by 20.9× as it speeds up the BLAST search component by 88.6× using 128 compute nodes.

Keywords

COVID-19, BLAST, Big Data, high-performance computing, Spark, genomic diversity analysis

Persistent link

https://hdl.handle.net/10919/142233

Collections

All Faculty Deposits
Scholarly Works, Computer Science

Full item page

Scalable and Maintainable Distributed Sequence Alignment Using Spark

Files

TR Number

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

Persistent link

Collections