On the Scalability of Computing Genomic Diversity Using SparkLeBLAST: A Feasibility Study

Loading...
Thumbnail Image

TR Number

Date

2024-09

Journal Title

Journal ISSN

Volume Title

Publisher

IEEE

Abstract

Studying the genomic diversity of viruses can help us understand how viruses evolve and how that evolution can impact human health. Rather than use a laborious and tedious wet-lab approach to conduct a genomic diversity study, we take a computational approach, using the ubiquitous NCBI BLAST and our parallel and distributed SparkLeBLAST, across 53 patients (40,000,000 query sequences) on Fugaku, the world's fastest homogeneous supercomputer with 158,976 nodes, where each code contains a 48-core A64FX processor and 32 GB RAM. To project how long BLAST and SparkLeBLAST would take to complete a genomic diversity study of COVID-19, we first perform a feasibility study on a subset of 50 query sequences from a single COVID-19 patient to identify bottlenecks in sequence alignment processing. We then create a model using Amdahl's law to project the run times of NCBI BLAST and SparkLeBLAST on supercomputing systems like Fugaku. Based on the data from this 50-sequence feasibility study, our model predicts that NCBI BLAST, when running on all the cores of the Fugaku supercomputer, would take approximately 26.7 years to complete the full-scale study. In contrast, SparkLeBLAST, using both our query and database segmentation, would reduce the execution time to 0.026 years (i.e., 22.9 hours) - resulting in more than a 10,000× speedup over using the ubiquitous NCBI BLAST.

Description

Keywords

NCBI BLAST, SparkLeBLAST, COVID-19, genomic diversity, pairwise sequence search, feasibility, scalability, A64FX CPU, supercomputer

Citation