Scalable and Maintainable Distributed Sequence Alignment Using Spark

dc.contributor.authorYoussef, Karimen
dc.contributor.authorElnady, Yusufen
dc.contributor.authorTilevich, Elien
dc.contributor.authorFeng, Wu-chunen
dc.date.accessioned2026-03-13T12:40:02Zen
dc.date.available2026-03-13T12:40:02Zen
dc.date.issued2025-07en
dc.description.abstractThe exponential growth of genomic data presents a challenge to bioinformatics research. NCBI BLAST, a popular pairwise sequence alignment tool, does not scale with the hundreds of gigabytes (GB) of sequenced data. Therefore, mpiBLAST was widely adopted and scaled up to 65,536 processors. However, mpiBLAST is tightly coupled with an obsolete NCBI BLAST version, creating a challenge to upgrading mpiBLAST with the ever-changing NCBI BLAST code. Recent parallel BLAST implementations, like SparkBLAST, use parallelism wrappers separate from NCBI BLAST to overcome this issue. However, query partitioning, a parallel method that duplicates the genome database on each compute node, makes SparkBLAST scale poorly with databases larger than a single node's memory. Thus, no parallel BLAST utility simultaneously addresses performance, scalability, and software maintainability. To fill this gap, we introduce SparkLeBLAST, a parallel BLAST tool that uses the Spark framework and efficient data partitioning to combine mpiBLAST's performance and scalability with SparkBLAST's simplicity and maintainability. SparkLeBLAST democratizes scalable genomic analysis for domain scientists without extensive distributed computing experience. SparkLeBLAST runs up to 6.68× faster than SparkBLAST. SparkLeBLAST also accelerates taxonomic assignment of COVID-19 genomic diversity analysis by 20.9× as it speeds up the BLAST search component by 88.6× using 128 compute nodes.en
dc.description.versionSubmitted versionen
dc.format.extentPages 1388-1400en
dc.format.extent13 page(s)en
dc.format.mimetypeapplication/pdfen
dc.identifier4 (Article number)en
dc.identifier.doihttps://doi.org/10.1109/TCBBIO.2025.3565188en
dc.identifier.eissn2998-4165en
dc.identifier.issn2998-4165en
dc.identifier.issue4en
dc.identifier.orcidFeng, Wu-Chun [0000-0002-6015-0727]en
dc.identifier.orcidTilevich, Eli [0000-0003-2415-6926]en
dc.identifier.pmid40811315en
dc.identifier.urihttps://hdl.handle.net/10919/142233en
dc.identifier.volume22en
dc.language.isoenen
dc.publisherIEEEen
dc.relation.urihttps://www.ncbi.nlm.nih.gov/pubmed/40811315en
dc.rightsIn Copyrighten
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subjectCOVID-19en
dc.subjectBLASTen
dc.subjectBig Dataen
dc.subjecthigh-performance computingen
dc.subjectSparken
dc.subjectgenomic diversity analysisen
dc.titleScalable and Maintainable Distributed Sequence Alignment Using Sparken
dc.title.serialIEEE Transactions on Computational Biology and Bioinformaticsen
dc.typeArticleen
dc.type.dcmitypeTexten
dc.type.otherArticleen
dc.type.otherJournalen
pubs.organisational-groupVirginia Techen
pubs.organisational-groupVirginia Tech/Engineeringen
pubs.organisational-groupVirginia Tech/Engineering/Computer Scienceen
pubs.organisational-groupVirginia Tech/Faculty of Health Sciencesen
pubs.organisational-groupVirginia Tech/All T&R Facultyen
pubs.organisational-groupVirginia Tech/Engineering/COE T&R Facultyen

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TCBB25-SparkLeBLAST.pdf
Size:
6.98 MB
Format:
Adobe Portable Document Format
Description:
Submitted version
License bundle
Now showing 1 - 1 of 1
Name:
license.txt
Size:
1.5 KB
Format:
Plain Text
Description: