VTechWorks Repository :: Browsing by Author "Balaji, Pavan"

Browsing by Author "Balaji, Pavan"

Now showing 1 - 11 of 11

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multicore Environments
Narayanaswamy, Ganesh; Balaji, Pavan; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)
This paper analyzes the interactions between the protocol stack (TCP/IP or iWARP over 10-Gigabit Ethernet) and its multicore environment. Specifically, for host-based protocols such as TCP/IP, we notice that a significant amount of processing is statically assigned to a single core, resulting in an imbalance of load on the different cores of the system and adversely impacting the performance of many applications. For host-offloaded protocols such as iWARP, on the other hand, the portions of the communication stack that are performed on the host, such as buffering of messages and memory copies, are closely tied with the associated process, and hence do not create such load imbalances. Thus, in this paper, we demonstrate that by intelligently mapping different processes of an application to specific cores, the imbalance created by the TCP/IP protocol stack can be largely countered and application performance significantly improved. At the same time, since the load is a better balanced in host-offloaded protocols such as iWARP, such mapping does not adversely affect their performance, thus keeping the mapping generic enough to be used with multiple protocol stacks.
Cognizant Networks: A Model and Framework for Session-based Communications and Adaptive Networking
Kalim, Umar (Virginia Tech, 2017-08-09)
The Internet has made tremendous progress since its inception. The kingpin has been the transmission control protocol (TCP), which supports a large fraction of communication. With the Internet's wide-spread access, users now have increased expectations. The demands have evolved to an extent which TCP was never designed to support. Since network stacks do not provide the necessary functionality for modern applications, developers are forced to implement them over and over again --- as part of the application or supporting libraries. Consequently, application developers not only bear the burden of developing application features but are also responsible for building networking libraries to support sophisticated scenarios. This leads to considerable duplication of effort. The challenge for TCP in supporting modern use cases is mostly due to limiting assumptions, simplistic communication abstractions, and (once expedient) implementation shortcuts. To further add to the complexity, the limited TCP options space is insufficient to support extensibility and thus, contemporary communication patterns. Some argue that radical changes are required to extend the networks functionality; some researchers believe that a clean slate approach is the only path forward. Others suggest that evolution of the network stack is necessary to ensure wider adoption --- by avoiding a flag day. In either case, we see that the proposed solutions have not been adopted by the community at large. This is perhaps because the cost of transition from the incumbent to the new technology outweighs the value offered. In some cases, the limited scope of the proposed solutions limit their value. In other cases, the lack of backward compatibility or significant porting effort precludes incremental adoption altogether. In this dissertation, we focus on the development of a communication model that explicitly acknowledges the context of the conversation and describes (much of) modern communications. We highlight how the communication stack should be able to discover, interact with and use available resources to compose richer communication constructs. The model is able to do so by using session, flow and endpoint abstractions to describe communications between two or more endpoints. These abstractions provide means to the application developers for setting up and manipulating constructs, while the ability to recognize change in the operating context and reconfigure the constructs allows applications to adapt to the changing requirements. The model considers two or more participants to be involved in the conversation and thus enables most modern communication patterns, which is in contrast with the well-established two-participant model. Our contributions also include an implementation of a framework that realizes such communication methods and enables future innovation. We substantiate our claims by demonstrating case studies where we use the proposed abstractions to highlight the gains. We also show how the proposed model may be implemented in a backwards compatible manner, such that it does not break legacy applications, network stacks, or middleboxes in the network infrastructure. We also present use cases to substantiate our claims about backwards compatibility. This establishes that incremental evolution is possible. We highlight the benefits of context awareness in setting up complex communication constructs by presenting use cases and their evaluation. Finally, we show how the communication model may open the door for new and richer communication patterns.
Generalizing the Utility of Graphics Processing Units in Large-Scale Heterogeneous Computing Systems
Xiao, Shucai (Virginia Tech, 2013-07-03)
Today, heterogeneous computing systems are widely used to meet the increasing demand for high-performance computing. These systems commonly use powerful and energy-efficient accelerators to augment general-purpose processors (i.e., CPUs). The graphic processing unit (GPU) is one such accelerator. Originally designed solely for graphics processing, GPUs have evolved into programmable processors that can deliver massive parallel processing power for general-purpose applications. Using SIMD (Single Instruction Multiple Data) based components as building units; the current GPU architecture is well suited for data-parallel applications where the execution of each task is independent. With the delivery of programming models such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL), programming GPUs has become much easier than before. However, developing and optimizing an application on a GPU is still a challenging task, even for well-trained computing experts. Such programming tasks will be even more challenging in large-scale heterogeneous systems, particularly in the context of utility computing, where GPU resources are used as a service. These challenges are largely due to the limitations in the current programming models: (1) there are no intra-and inter-GPU cooperative mechanisms that are natively supported; (2) current programming models only support the utilization of GPUs installed locally; and (3) to use GPUs on another node, application programs need to explicitly call application programming interface (API) functions for data communication. To reduce the mapping efforts and to better utilize the GPU resources, we investigate generalizing the utility of GPUs in large-scale heterogeneous systems with GPUs as accelerators. We generalize the utility of GPUs through the transparent virtualization of GPUs, which can enable applications to view all GPUs in the system as if they were installed locally. As a result, all GPUs in the system can be used as local GPUs. Moreover, GPU virtualization is a key capability to support the notion of "GPU as a service." Specifically, we propose the virtual OpenCL (or VOCL) framework for the transparent virtualization of GPUs. To achieve good performance, we optimize and extend the framework in three aspects: (1) optimize VOCL by reducing the data transfer overhead between the local node and remote node; (2) propose GPU synchronization to reduce the overhead of switching back and forth if multiple kernel launches are needed for data communication across different compute units on a GPU; and (3) extend VOCL to support live virtual GPU migration for quick system maintenance and load rebalancing across GPUs. With the above optimizations and extensions, we thoroughly evaluate VOCL along three dimensions: (1) show the performance improvement for each of our optimization strategies; (2) evaluate the overhead of using remote GPUs via several microbenchmark suites as well as a few real-world applications; and (3) demonstrate the overhead as well as the benefit of live virtual GPU migration. Our experimental results indicate that VOCL can generalize the utility of GPUs in large-scale systems at a reasonable virtualization and migration cost.
GePSeA: A General-Purpose Software Acceleration Framework for Lightweight Task Offloading
Singh, Ajeet (Virginia Tech, 2009-07-14)
Hardware-acceleration techniques continue to be used to boost the performance of scientific codes. To do so, software developers identify portions of these codes that are amenable for offloading and map them to hardware accelerators. However, offloading such tasks to specialized hardware accelerators is non-trivial. Furthermore, these accelerators can add significant cost to a computing system. Consequently, this thesis proposes a framework called GePSeA (General Purpose Software Acceleration Framework), which uses a small fraction of the computational power on multi-core architectures to offload complex application-specific tasks. Specifically, GePSeA provides a lightweight process that acts as a helper agent to the application by executing application-specific tasks asynchronously and efficiently. GePSeA is not meant to replace hardware accelerators but to extend them. GePSeA provide several utilities called core components that offload tasks on to the core or to the special-purpose hardware when available in a way that is transparent to the application. Examples of such core components include reliable communication service, distributed lock management, global memory management, dynamic load distribution and network protocol processing. We then apply the GePSeA framework to two applications, namely mpiBLAST, an open-source computational biology application and Reliable Blast UDP (RBUDP) based file transfer application. We observe significant speed-up for both applications.
Impact of Network Sharing in Multi-core Architectures
Narayanaswamy, Ganesh; Balaji, Pavan; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008-03-01)
As commodity components continue to dominate the realm of high-end computing, two hardware trends have emerged as major contributors to this - high-speed networking technologies and multi-core architectures. Communication middleware such as the Message Passing Interface (MPI) use the network technology for communicating between processes that reside on different physical nodes while using shared memory for communicating between processes on different cores within the same node. Thus, two conflicting possibilities arise: (i) with the advent of multi-core architectures, the number of processes that reside on the same physical node and hence share the same physical network can potentially increase significantly resulting in {\em increased} network usage and (ii) given the increase in intra-node shared-memory communication for processes residing on the same node, the network usage can potentially {\em reduce} significantly. In this paper, we address these two conflicting possibilities and study the behavior of network usage in multi-core environments with sample scientific applications. Specifically, we analyze trends that result in increase or decrease of network usage and derive insights on application performance based on these. We also study the sharing of different resources in the system in multi-core environments and identify the contribution of the network in this mix. Finally, we study different process allocation strategies and analyze their impact on such network sharing.
MPI-ACC: Accelerator-Aware MPI for Scientific Applications
Aji, Ashwin M.; Panwar, Lokendra S.; Ji, Feng; Murthy, Karthik; Chabbi, Milind; Balaji, Pavan; Bisset, Keith R.; Dinan, James; Feng, Wu-chun; Mellor-Crummey, John; Ma, Xiaosong; Thakur, Rajeev (2016-05-01)
On the Interaction of High-Performance Network Protocol Stacks with Multicore Architectures
Chunangad Narayanaswamy, Ganesh (Virginia Tech, 2008-04-18)
Multicore architectures have been one of the primary driving forces in the recent rapid growth in high-end computing systems, contributing to its growing scales and capabilities. With significant enhancements in high-speed networking technologies and protocol stacks which support these high-end systems, a growing need to understand the interaction between them closely is realized. Since these two components have been designed mostly independently, there tend to have often serious and surprising interactions that result in heavy asymmetry in the effective capability of the different cores, thereby degrading the performance for various applications. Similarly, depending on the communication pattern of the application and the layout of processes across nodes, these interactions could potentially introduce network scalability issues, which is also an important concern for system designers. In this thesis, we analyze these asymmetric interactions and propose and design a novel systems level management framework called SIMMer (Systems Interaction Mapping Manager) that automatically monitors these interactions and dynamically manages the mapping of processes on processor cores to transparently maximize application performance. Performance analysis of SIMMer shows that it can improve the communication performance of applications by more than twofold and the overall application performance by 18%. We further analyze the impact of contention in network and processor resources and relate it to the communication pattern of the application. Insights learnt from these analyses can lead to efficient runtime configurations for scientific applications on multicore architectures.
A Pluggable Framework for Lightweight Task Ofﬂoading in Parallel and Distributed Computing
Singh, Ajeet; Balaji, Pavan; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)
Multicore processors have quickly become ubiquitous in supercomputing, cluster computing, datacenter computing, and even personal computing. Software advances, however, continue to lag behind. In the past, software designers could simply rely on clock-speed increases to improve the performance of their software. With clock speeds now stagnant, software designers need to tap into the increased horsepower of multiple cores in a processor by creating software artifacts that support parallelism. Rather than forcing designers to write such software artifacts from scratch, we propose a pluggable framework that designers can reuse for lightweight task ofﬂoading in a parallel computing environment of multiple cores, whether those cores be colocated on a processor within a compute node, between compute nodes in a tightly-coupled system like a supercomputer, or between compute nodes in a loosely-coupled one like a cloud computer. To demonstrate the efﬁcacy of our framework, we use the framework to implement lightweight task ofﬂoading (or software acceleration) for a popular parallel sequence-search application called mpiBLAST. Our experimental results on a 9-node, 36-core AMD Opteron cluster show that using mpiBLAST with our pluggable framework results in a 205% speed-up.
Programming High-Performance Clusters with Heterogeneous Computing Devices
Aji, Ashwin M. (Virginia Tech, 2015-05-19)
Today's high-performance computing (HPC) clusters are seeing an increase in the adoption of accelerators like GPUs, FPGAs and co-processors, leading to heterogeneity in the computation and memory subsystems. To program such systems, application developers typically employ a hybrid programming model of MPI across the compute nodes in the cluster and an accelerator-specific library (e.g.; CUDA, OpenCL, OpenMP, OpenACC) across the accelerator devices within each compute node. Such explicit management of disjointed computation and memory resources leads to reduced productivity and performance. This dissertation focuses on designing, implementing and evaluating a runtime system for HPC clusters with heterogeneous computing devices. This work also explores extending existing programming models to make use of our runtime system for easier code modernization of existing applications. Specifically, we present MPI-ACC, an extension to the popular MPI programming model and runtime system for efficient data movement and automatic task mapping across the CPUs and accelerators within a cluster, and discuss the lessons learned. MPI-ACC's task-mapping runtime subsystem performs fast and automatic device selection for a given task. MPI-ACC's data-movement subsystem includes careful optimizations for end-to-end communication among CPUs and accelerators, which are seamlessly leveraged by the application developers. MPI-ACC provides a familiar, flexible and natural interface for programmers to choose the right computation or communication targets, while its runtime system achieves efficient cluster utilization.
Runtime Adaptation for Autonomic Heterogeneous Computing
Scogland, Thomas R. (Virginia Tech, 2014-12-12)
Heterogeneity is increasing across all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other coprocessors into everything from cell phones to supercomputers. More quietly it is increasing with the rise of NUMA systems, hierarchical caching, OS noise, and a myriad of other factors. As heterogeneity becomes a fact of life, efficiently managing heterogeneous compute resources is becoming a critical, and ever more complex, task. The focus of this dissertation is to lay the foundation for an autonomic system for heterogeneous computing, employing runtime adaptation to improve performance portability and performance consistency while maintaining or increasing programmability. We investigate heterogeneity arising from a myriad of factors, grouped into the dimensions of locality and capability. This work has resulted in runtime schedulers capable of automatically detecting and mitigating heterogeneity in physically homogeneous systems through MPI and adaptive coscheduling for physically heterogeneous accelerator based systems as well as a synthesis of the two to address multiple levels of heterogeneity as a coherent whole. We also discuss our current work towards the next generation of fine-grained scheduling and synchronization across heterogeneous platforms in the design of a highly-scalable and portable concurrent queue for many-core systems. Each component addresses aspects of the urgent need for automated management of the extreme and ever expanding complexity introduced by heterogeneity.
Scalability Analysis and Optimization for Large-Scale Deep Learning
Pumma, Sarunya (Virginia Tech, 2020-02-03)
Despite its growing importance, scalable deep learning (DL) remains a difficult challenge. Scalability of large-scale DL is constrained by many factors, including those deriving from data movement and data processing. DL frameworks rely on large volumes of data to be fed to the computation engines for processing. However, current hardware trends showcase that data movement is already one of the slowest components in modern high performance computing systems, and this gap is only going to increase in the future. This includes data movement needed from the filesystem, within the network subsystem, and even within the node itself, all of which limit the scalability of DL frameworks on large systems. Even after data is moved to the computational units, managing this data is not easy. Modern DL frameworks use multiple components---such as graph scheduling, neural network training, gradient synchronization, and input pipeline processing---to process this data in an asynchronous uncoordinated manner, which results in straggler processes and consequently computational imbalance, further limiting scalability. This thesis studies a subset of the large body of data movement and data processing challenges that exist in modern DL frameworks. For the first study, we investigate file I/O constraints that limit the scalability of large-scale DL. We first analyze the Caffe DL framework with Lightning Memory-Mapped Database (LMDB), one of the most widely used file I/O subsystems in DL frameworks, to understand the causes of file I/O inefficiencies. Based on our analysis, we propose LMDBIO---an optimized I/O plugin for scalable DL that addresses the various shortcomings in existing file I/O for DL. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on 9,216 CPUs of the Blues and Bebop supercomputers at Argonne National Laboratory. Our second study deals with the computational imbalance problem in data processing. For most DL systems, the simultaneous and asynchronous execution of multiple data-processing components on shared hardware resources causes these components to contend with one another, leading to severe computational imbalance and degraded scalability. We propose various novel optimizations that minimize resource contention and improve performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory---the world's largest supercomputer at the time of writing of this thesis.

Browsing by Author "Balaji, Pavan"

Results Per Page

Sort Options