VTechWorks Repository :: Browsing by Author "Rafique, Muhammad Mustafa"

Browsing by Author "Rafique, Muhammad Mustafa"

Now showing 1 - 4 of 4

An Adaptive Framework for Managing Heterogeneous Many-Core Clusters
Rafique, Muhammad Mustafa (Virginia Tech, 2011-09-22)
The computing needs and the input and result datasets of modern scientific and enterprise applications are growing exponentially. To support such applications, High-Performance Computing (HPC) systems need to employ thousands of cores and innovative data management. At the same time, an emerging trend in designing HPC systems is to leverage specialized asymmetric multicores, such as IBM Cell and AMD Fusion APUs, and commodity computational accelerators, such as programmable GPUs, which exhibit excellent price to performance ratio as well as the much needed high energy efficiency. While such accelerators have been studied in detail as stand-alone computational engines, integrating the accelerators into large-scale distributed systems with heterogeneous computing resources for data-intensive computing presents unique challenges and trade-offs. Traditional programming and resource management techniques cannot be directly applied to many-core accelerators in heterogeneous distributed settings, given the complex and custom instruction sets architectures, memory hierarchies and I/O characteristics of different accelerators. In this dissertation, we explore the design space of using commodity accelerators, specifically IBM Cell and programmable GPUs, in distributed settings for data-intensive computing and propose an adaptive framework for programming and managing heterogeneous clusters. The proposed framework provides a MapReduce-based extended programming model for heterogeneous clusters, which distributes tasks between asymmetric compute nodes by considering workload characteristics and capabilities of individual compute nodes. The framework provides efficient data prefetching techniques that leverage general-purpose cores to stage the input data in the private memories of the specialized cores. We also explore the use of an advanced layered-architecture based software engineering approach and provide mixin-layers based reusable software components to enable easy and quick deployment of heterogeneous clusters. The framework also provides multiple resource management and scheduling policies under different constraints, e.g., energy-aware and QoS-aware, to support executing concurrent applications on multi-tenant heterogeneous clusters. When applied to representative applications and benchmarks, our framework yields significantly improved performance in terms of programming efficiency and optimal resource management as compared to conventional, hand-tuned, approaches to program and manage accelerator-based heterogeneous clusters.
A Software Framework For the Detection and Classification of Biological Targets in Bio-Nano Sensing
Hafeez, Abdul (Virginia Tech, 2014-09-08)
Detection and identification of important biological targets, such as DNA, proteins, and diseased human cells are crucial for early diagnosis and prognosis. The key to discriminate healthy cells from the diseased cells is the biophysical properties that differ radically. Micro and nanosystems, such as solid-state micropores and nanopores can measure and translate these properties of biological targets into electrical spikes to decode useful insights. Nonetheless, such approaches result in sizable data streams that are often plagued with inherit noise and baseline wanders. Moreover, the extant detection approaches are tedious, time-consuming, and error-prone, and there is no error-resilient software that can analyze large data sets instantly. The ability to effectively process and detect biological targets in larger data sets lie in the automated and accelerated data processing strategies using state-of-the-art distributed computing systems. In this dissertation, we design and develop techniques for the detection and classification of biological targets and a distributed detection framework to support data processing from multiple bio-nano devices. In a distributed setup, the collected raw data stream on a server node is split into data segments and distributed across the participating worker nodes. Each node reduces noise in the assigned data segment using moving-average filtering, and detects the electric spikes by comparing them against a statistical threshold (based on the mean and standard deviation of the data), in a Single Program Multiple Data (SPMD) style. Our proposed framework enables the detection of cancer cells in a mixture of cancer cells, red blood cells (RBCs), and white blood cells (WBCs), and achieves a maximum speedup of 6X over a single-node machine by processing 10 gigabytes of raw data using an 8-node cluster in less than a minute, which will otherwise take hours using manual analysis. Diseases such as cancer can be mitigated, if detected and treated at an early stage. Micro and nanoscale devices, such as micropores and nanopores, enable the translocation of biological targets at finer granularity. These devices are tiny orifices in silicon-based membranes, and the output is a current signal, measured in nanoamperes. Solid-state micropore is capable of electrically measuring the biophysical properties of human cells, when a blood sample is passed through it. The passage of cells via such pores results in an interesting pattern (pulse) in the baseline current, which can be measured at a very high rate, such as 500,000 samples per second, and even higher resolution. The pulse is essentially a sequence of temporal data samples that abruptly falls below and then reverts back to a normal baseline with an acceptable predefined time interval, i.e., pulse width. The pulse features, such as width and amplitude, correspond to the translocation behavior and the extent to which the pore is blocked, under a constant potential. These features are crucial in discriminating the diseased cells from healthy cells, such as identifying cancer cells in a mixture of cells.
Towards a Resource Efficient Framework for Distributed Deep Learning Applications
Han, Jingoo (Virginia Tech, 2022-08-24)
Distributed deep learning has achieved tremendous success for solving scientific problems in research and discovery over the past years. Deep learning training is quite challenging because it requires training on large-scale massive dataset, especially with graphics processing units (GPUs) in latest high-performance computing (HPC) supercomputing systems. HPC architectures bring different performance trends in training throughput compared to the existing studies. Multiple GPUs and high-speed interconnect are used for distributed deep learning on HPC systems. Extant distributed deep learning systems are designed for non-HPC systems without considering efficiency, leading to under-utilization of expensive HPC hardware. In addition, increasing resource heterogeneity has a negative effect on resource efficiency in distributed deep learning methods including federated learning. Thus, it is important to focus on an increasing demand for both high performance and high resource efficiency for distributed deep learning systems, including latest HPC systems and federated learning systems. In this dissertation, we explore and design novel methods and frameworks to improve resource efficiency of distributed deep learning training. We address the following five important topics: performance analysis on deep learning for supercomputers, GPU-aware deep learning job scheduling, topology-aware virtual GPU training, heterogeneity-aware adaptive scheduling, and token-based incentive algorithm. In the first chapter (Chapter 3), we explore and focus on analyzing performance trend of distributed deep learning on latest HPC systems such as Summitdev supercomputer at Oak Ridge National Laboratory. We provide insights by conducting a comprehensive performance study on how deep learning workloads have effects on the performance of HPC systems with large-scale parallel processing capabilities. In the second part (Chapter 4), we design and develop a novel deep learning job scheduler MARBLE, which considers efficiency of GPU resource based on non-linear scalability of GPUs in a single node and improves GPU utilization by sharing GPUs with multiple deep learning training workloads. The third part of this dissertation (Chapter 5) proposes topology-aware virtual GPU training systems TOPAZ, specifically designed for distributed deep learning on recent HPC systems. In the fourth chapter (Chapter 6), we conduct exploration on an innovative holistic federated learning scheduling that employs a heterogeneity-aware adaptive selection method for improving resource efficiency and accuracy performance, coupled with resource usage profiling and accuracy monitoring to achieve multiple goals. In the fifth part of this dissertation (Chapter 7), we are focused on how to provide incentives to participants according to contribution for reaching high performance of final federated model, while tokens are used as a means of paying for the services of providing participants and the training infrastructure.
Towards SLO-aware Resource Scheduling for Serverless Inference Workloads
Tripathy, Abhijit (Virginia Tech, 2023-08-08)
The rapid advancement of Machine Learning (ML) and Deep Learning (DL) has revolutionized various domains, necessitating efficient and cost-effective ML inference capabilities. Function-as-a-Service (FaaS) has emerged as a promising approach for hosting ML inference services, providing a serverless computing environment that streamlines development cycles and offers scalability and simplified infrastructure management. However, existing autoscaling strategies employed by popular FaaS platforms often overlook critical factors such as response time and tail latency. Additionally, Python's Global Interpreter Lock (GIL) poses challenges for parallel computing in high-request traffic scenarios. This thesis addresses the need for efficient and cost-effective Machine Learning (ML) inference capabilities by exploring batching and autoscaling strategies for Serverless Inference instances. The study proposes a prototype FaaS framework that provides adaptive request batching, reactive autoscaling policies, and SLO monitoring, thus allowing Serverless Inference workloads to meet their SLO targets even during peak traffic. The proposed approach aims to optimize resource utilization, mitigate tail latency, and improve overall system performance.

Browsing by Author "Rafique, Muhammad Mustafa"

Results Per Page

Sort Options