Browsing by Author "Gardner, Mark K."
Now showing 1 - 9 of 9
Results Per Page
Sort Options
- Cognizant Networks: A Model and Framework for Session-based Communications and Adaptive NetworkingKalim, Umar (Virginia Tech, 2017-08-09)The Internet has made tremendous progress since its inception. The kingpin has been the transmission control protocol (TCP), which supports a large fraction of communication. With the Internet's wide-spread access, users now have increased expectations. The demands have evolved to an extent which TCP was never designed to support. Since network stacks do not provide the necessary functionality for modern applications, developers are forced to implement them over and over again --- as part of the application or supporting libraries. Consequently, application developers not only bear the burden of developing application features but are also responsible for building networking libraries to support sophisticated scenarios. This leads to considerable duplication of effort. The challenge for TCP in supporting modern use cases is mostly due to limiting assumptions, simplistic communication abstractions, and (once expedient) implementation shortcuts. To further add to the complexity, the limited TCP options space is insufficient to support extensibility and thus, contemporary communication patterns. Some argue that radical changes are required to extend the networks functionality; some researchers believe that a clean slate approach is the only path forward. Others suggest that evolution of the network stack is necessary to ensure wider adoption --- by avoiding a flag day. In either case, we see that the proposed solutions have not been adopted by the community at large. This is perhaps because the cost of transition from the incumbent to the new technology outweighs the value offered. In some cases, the limited scope of the proposed solutions limit their value. In other cases, the lack of backward compatibility or significant porting effort precludes incremental adoption altogether. In this dissertation, we focus on the development of a communication model that explicitly acknowledges the context of the conversation and describes (much of) modern communications. We highlight how the communication stack should be able to discover, interact with and use available resources to compose richer communication constructs. The model is able to do so by using session, flow and endpoint abstractions to describe communications between two or more endpoints. These abstractions provide means to the application developers for setting up and manipulating constructs, while the ability to recognize change in the operating context and reconfigure the constructs allows applications to adapt to the changing requirements. The model considers two or more participants to be involved in the conversation and thus enables most modern communication patterns, which is in contrast with the well-established two-participant model. Our contributions also include an implementation of a framework that realizes such communication methods and enables future innovation. We substantiate our claims by demonstrating case studies where we use the proposed abstractions to highlight the gains. We also show how the proposed model may be implemented in a backwards compatible manner, such that it does not break legacy applications, network stacks, or middleboxes in the network infrastructure. We also present use cases to substantiate our claims about backwards compatibility. This establishes that incremental evolution is possible. We highlight the benefits of context awareness in setting up complex communication constructs by presenting use cases and their evaluation. Finally, we show how the communication model may open the door for new and richer communication patterns.
- Cu2cl: a Cuda-To-Opencl Translator for Multi- and Many-Core ArchitecturesMartinez Arroyo, Gabriel Ernesto (Virginia Tech, 2011-07-14)The use of graphics processing units (GPUs) in high-performance parallel computing continues to steadily become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendor-neutral programming environment and run-time system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a "write once, run anywhere" ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is almost straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to-OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in the CUDA Runtime API, and we have successfully translated several applications from the CUDA SDK and Rodinia benchmark suite. CU2CL's translation times are reasonable, allowing for many applications to be translated at once. The number of manual changes required after executing our translator on CUDA source is minimal, with some compiling and working with no changes at all. The performance of our automatically translated applications via CU2CL is on par with their manually ported counterparts.
- CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core ArchitecturesMartinez, Gabriel; Feng, Wu-chun; Gardner, Mark K. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2011)The use of graphics processing units (GPUs) in high-performance parallel computing continues to become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendorneutral programming environment and runtime system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a “write once, run anywhere” ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is typically straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to- OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in CUDA runtime API, and we have successfully translated many applications from the CUDA SDK and Rodinia benchmark suite. The performance of our automatically translated applications via CU2CL is on par with their manually ported countparts.
- Holistic Abstraction for Distributed Network DebuggingKhan, Jehandad (Virginia Tech, 2018-03-15)Computer networks are engineered for performance and flexibility, delivering billions of packets per second with high reliability, until they fail. It is during such time of crisis that debugging and troubleshooting come to the forefront, however, the focus on performance results in design tradeoffs that make it challenging to troubleshoot them. This dissertation hypothesizes that a view of the network as a single entity solves the above problems, without compromising either performance or visibility. The primary contributions are 1) a topology oblivious network abstraction for performance monitoring and troubleshooting, 2) transformation of the network abstract query to device local semantics, 3) optimizations for reducing state collection overhead, and 4) global state semantics in the proposed query language easing expression of network queries. Abstracting the entire system as an entity simplifies the debugging process, making possible comprehensive root-cause analysis and exonerating the network administrator from dealing with many devices, delivering gains in productivity and efficiency. By merging network topology information with state collection, this thesis provides a new way to look at the network monitoring and troubleshooting problem. Such an amalgamation allows the translation of a performance query expressed in a domain specific language to small pieces of code operating on different devices in the network collecting necessary state. This merging results in lesser overhead per switch and reduces the strain on devices and provides a simple abstraction to the administrator.
- Metrics, Models and Methodologies for Energy-Proportional ComputingSubramaniam, Balaji (Virginia Tech, 2015-08-21)Massive data centers housing thousands of computing nodes have become commonplace in enterprise computing, and the power consumption of such data centers is growing at an unprecedented rate. Exacerbating such costs, data centers are often over-provisioned to avoid costly outages associated with the potential overloading of electrical circuitry. However, such over provisioning is often unnecessary since a data center rarely operates at its maximum capacity. It is imperative that we realize effective strategies to control the power consumption of the server and improve the energy efficiency of data centers. Adding to the problem is the inability of the servers to exhibit energy proportionality which diminishes the overall energy efficiency of the data center. Therefore in this dissertation, we investigate whether it is possible to achieve energy proportionality at the server- and cluster-level by efficient power and resource provisioning. Towards this end, we provide a thorough analysis of energy proportionality at the server and cluster-level and provide insight into the power saving opportunity and mechanisms to improve energy proportionality. Specifically, we make the following contribution at the server-level using enterprise-class workloads. We analyze the average power consumption of the full system as well as the subsystems and describe the energy proportionality of these components, characterize the instantaneous power profile of enterprise-class workloads using the on-chip energy meters, design a runtime system based on a load prediction model and an optimization framework to set the appropriate power constraints to meet specific performance targets and then present the effects of our runtime system on energy proportionality, average power, performance and instantaneous power consumption of enterprise applications. We then make the following contributions at the cluster-level. Using data serving, web searching and data caching as our representative workloads, we first analyze the component-level power distribution on a cluster. Second, we characterize how these workloads utilize the cluster. Third, we analyze the potential of power provisioning techniques (i.e., active low-power, turbo and idle low-power modes) to improve the energy proportionality. We then describe the ability of active low-power modes to provide trade-offs in power and latency. Finally, we compare and contrast power provisioning and resource provisioning techniques. This thesis sheds light on mechanisms to tune the power provisioned for a system under strict performance targets and opportunities to improve energy proportionality and instantaneous power consumption via efficient power and resource provisioning at the server- and cluster-level.
- Models and Techniques for Green High-Performance ComputingAdhinarayanan, Vignesh (Virginia Tech, 2020-06-01)High-performance computing (HPC) systems have become power limited. For instance, the U.S. Department of Energy set a power envelope of 20MW in 2008 for the first exascale supercomputer now expected to arrive in 2021--22. Toward this end, we seek to improve the greenness of HPC systems by improving their performance per watt at the allocated power budget. In this dissertation, we develop a series of models and techniques to manage power at micro-, meso-, and macro-levels of the system hierarchy, specifically addressing data movement and heterogeneity. We target the chip interconnect at the micro-level, heterogeneous nodes at the meso-level, and a supercomputing cluster at the macro-level. Overall, our goal is to improve the greenness of HPC systems by intelligently managing power. The first part of this dissertation focuses on measurement and modeling problems for power. First, we study how to infer chip-interconnect power by observing the system-wide power consumption. Our proposal is to design a novel micro-benchmarking methodology based on data-movement distance by which we can properly isolate the chip interconnect and measure its power. Next, we study how to develop software power meters to monitor a GPU's power consumption at runtime. Our proposal is to adapt performance counter-based models for their use at runtime via a combination of heuristics, statistical techniques, and application-specific knowledge. In the second part of this dissertation, we focus on managing power. First, we propose to reduce the chip-interconnect power by proactively managing its dynamic voltage and frequency (DVFS) state. Toward this end, we develop a novel phase predictor that uses approximate pattern matching to forecast future requirements and in turn, proactively manage power. Second, we study the problem of applying a power cap to a heterogeneous node. Our proposal proactively manages the GPU power using phase prediction and a DVFS power model but reactively manages the CPU. The resulting hybrid approach can take advantage of the differences in the capabilities of the two devices. Third, we study how in-situ techniques can be applied to improve the greenness of HPC clusters. Overall, in our dissertation, we demonstrate that it is possible to infer power consumption of real hardware components without directly measuring them, using the chip interconnect and GPU as examples. We also demonstrate that it is possible to build models of sufficient accuracy and apply them for intelligently managing power at many levels of the system hierarchy.
- MOON: MapReduce on Opportunistic eNvironmentsLin, Heshan; Archuleta, Jeremy; Ma, Xiaosong; Feng, Wu-chun; Zhang, Zhe; Gardner, Mark K. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009)MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments.
- On the Complexity of Robust Source-to-Source Translation from CUDA to OpenCLSathre, Paul Daniel (Virginia Tech, 2013-06-12)The use of hardware accelerators in high-performance computing has grown increasingly prevalent, particularly due to the growth of graphics processing units (GPUs) as general-purpose (GPGPU) accelerators. Much of this growth has been driven by NVIDIA's CUDA ecosystem for developing GPGPU applications on NVIDIA hardware. However, with the increasing diversity of GPUs (including those from AMD, ARM, and Qualcomm), OpenCL has emerged as an open and vendor-agnostic environment for programming GPUs as well as other parallel computing devices such as the CPU (central processing unit), APU (accelerated processing unit), FPGA (field programmable gate array), and DSP (digital signal processor). The above, coupled with the broader array of devices supporting OpenCL and the significant conceptual and syntactic overlap between CUDA and OpenCL, motivated the creation of a CUDA-to-OpenCL source-to-source translator. However, there exist sufficient differences that make the translation non-trivial, providing practical limitations to both manual and automatic translation efforts. In this thesis, the performance, coverage, and reliability of a prototype CUDA-to-OpenCL source translator are addressed via extensive profiling of a large body of sample CUDA applications. An analysis of the sample body of applications is provided, which identifies and characterizes general CUDA source constructs and programming practices that obstruct our translation efforts. This characterization then led to more robust support for the translator, followed by an evaluation that demonstrated the performance of our automatically-translated OpenCL is on par with the original CUDA for a subset of sample applications when executed on the same NVIDIA device.
- SLIM: A Session-Layer Intermediary for Enabling Multi-Party and Reconfigurable CommunicationKalim, Umar; Gardner, Mark K.; Brown, Eric J.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-06-11)Increasingly, communication requires more from the network stack. Due to missing functionality, we see a proliferation of networking libraries that attempt to fill the void (e.g., iOS to OSX Handoff and Google Cast SDK). This leads to considerable duplication of effort. Further, the provisions for extending legacy protocol stacks is largely exhausted (e.g., TCP options space is mostly allocated) making the addition of future extensions much more challenging. We present SLIM, an extensible session-layer intermediary that extracts the duplicate functionality from modern networking libraries and provides the means for future extensibility to the network stack. SLIM enables mobility, multi-party communication, and dynamic reconfiguration of the network stack in a straightforward and elegant way. SLIM includes an out-of-band signaling channel, which not only enables reconfiguration, but also allows for incremental evolution of the stack. To start, we tease out elements of session management which are currently conflated with transport semantics in TCP. Doing so highlights the need for sessions in contemporary use cases. Next, we propose session, flow and end-point abstractions that allow application developers to describe communication between any number of participants.The abstractions apply to individual or a group communication allowing them to be managed as one. We describe the abstractions and evaluate them in terms of typical communication patterns. We demonstrate the abstractions via a prototype implementation of SLIM.