Computer Science Technical Reports

Permanent URI for this collection

https://hdl.handle.net/10919/19372

The Department of Computer Science collection of technical reports began in 1973. Please use the subject headings listed below for all submissions.

Subject Headings:

Algorithms
Big Data
Bioinformatics
Computational Biology
Computational Science and Engineering
Computer Graphics/Animation
Computer Science Education
Computer Systems
Cyberarts
Cybersecurity
Data and Text Mining
Digital Education
Digital Libraries
Discrete Event Simulation
High Performance Computing
Human Computer Interaction
Information Retrieval
Machine Learning
Mathematical Programming
Mathematical Software
Modeling and Simulation
Networking
Numerical Analysis
Parallel and Distributed Computing
Problem Solving Environments
Software Engineering
Theoretical Computer Science
Virtual/Augmented Reality
Visualization

Browse

Now showing 1 - 6 of 6

Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfs
Verma, Anshuman; Helal, Ahmed E.; Krommydas, Konstantinos; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance comes at the expense of programmability. FPGA developers use hardware design languages (HDLs) to implement the application data and control path and to design hardware modules for computational pipelines, memory management, synchronization, and communication. This process requires extensive knowledge of logic design, design automation tools, and low-level details of FPGA architecture, this consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. Although this significantly improved programmability yet an optimized GPU implementation of kernel may lack performance portability for FPGA. To improve the performance of OpenCL kernels on FPGAs we identify general techniques to optimize OpenCL kernels for FPGAs under device-specific hardware constraints. We then apply these optimizations techniques to the OpenDwarfs benchmark suite, which has diverse parallelism profiles and memory access patterns, in order to evaluate the effectiveness of the optimizations in terms of performance and resource utilization. Finally, we present the performance of structured grids and N-body dwarf-based benchmarks in the context of various optimization along with their potential re-factoring. We find that careful design of kernels for FPGA can result in a highly efficient pipeline achieving 91% of theoretical throughput for the structured grids dwarf. Index Terms—OpenDwarfs; FPGA; OpenCL; GPU; MIC; Accelerators; Performance Portability
An Automated Framework for Characterizing and Subsetting GPGPU Workloads
Adhinarayanan, Vignesh; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-12-18)
Graphics processing units (GPUs) are becoming increasingly common in today’s computing systems due to their superior performance and energy efficiency relative to their cost. To further improve these desired characteristics, researchers have proposed several software and hardware techniques. Evaluation of these proposed techniques could be tricky due to the ad-hoc nature in which applications are selected for evaluation. Sometimes researchers spend unnecessary time evaluating redundant workloads, which is particularly problematic for time-consuming studies involving simulation. Other times, they fail to expose the shortcomings of their proposed techniques when too few workloads are chosen for evaluation. To overcome these problems, we propose an automated framework that characterizes and subsets GPGPU workloads, depending on a user-chosen set of performance metrics/counters. This framework internally uses principal component analysis (PCA) to reduce the dimensionality of the chosen metrics and then uses hierarchical clustering to identify similarity among the workloads. In this study, we use our framework to identify redundancy in the recently released SPEC ACCEL OpenCL benchmark suite using a few architecture-dependent metrics. Our analysis shows that a subset of eight applications provides most of the diversity in the 19-application benchmark suite. We also subset the Parboil, Rodinia, and SHOC benchmark suites and then compare them against each another to identify “gaps” in these suites. As an example, we show that SHOC has many applications that are similar to each other and could benefit from adding four applications from Parboil to improve its diversity.
Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs
Krommydas, Konstantinos; Helal, Ahmed E.; Verma, Anshuman; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use hardware design languages (HDLs) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance; that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we identify general techniques to optimize OpenCL kernels for FPGAs under device-specific hardware constraints. We then apply these optimization techniques to the OpenDwarfs benchmark suite, with its diverse parallelism profiles and memory access patterns, in order to evaluate the effectiveness of the optimizations in terms of performance and resource utilization. Finally, we present the performance of the optimized OpenDwarfs, along with their potential re-factoring, to bridge the performance gap from programming in OpenCL versus programming in a HDL. Index Terms—OpenDwarfs; FPGA; OpenCL; GPU; GPGPU; MIC; Accelerators; Performance Portability
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
Scogland, Thomas R. W.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2014-08-06)
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of pro gramming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput concurrent FIFO queue for many-core architectures that avoids the bottlenecks common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000X) faster than lock-free and combining queues on GPU platforms and two times (2X) faster on CPU devices. These results deliver critical insight into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can actually increase throughput.
SLIM: A Session-Layer Intermediary for Enabling Multi-Party and Reconfigurable Communication
Kalim, Umar; Gardner, Mark K.; Brown, Eric J.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-06-11)
Increasingly, communication requires more from the network stack. Due to missing functionality, we see a proliferation of networking libraries that attempt to fill the void (e.g., iOS to OSX Handoff and Google Cast SDK). This leads to considerable duplication of effort. Further, the provisions for extending legacy protocol stacks is largely exhausted (e.g., TCP options space is mostly allocated) making the addition of future extensions much more challenging. We present SLIM, an extensible session-layer intermediary that extracts the duplicate functionality from modern networking libraries and provides the means for future extensibility to the network stack. SLIM enables mobility, multi-party communication, and dynamic reconfiguration of the network stack in a straightforward and elegant way. SLIM includes an out-of-band signaling channel, which not only enables reconfiguration, but also allows for incremental evolution of the stack. To start, we tease out elements of session management which are currently conflated with transport semantics in TCP. Doing so highlights the need for sessions in contemporary use cases. Next, we propose session, flow and end-point abstractions that allow application developers to describe communication between any number of participants.The abstractions apply to individual or a group communication allowing them to be managed as one. We describe the abstractions and evaluate them in terms of typical communication patterns. We demonstrate the abstractions via a prototype implementation of SLIM.
Telescoping Architectures: A Methodology for Evaluating Next-Generation Heterogeneous Computing
Krommydas, Konstantinos; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)
Architectural innovation has telescoped the HPC community from the commodity (Beowulf) cluster in a machine room, i.e., a multi-node system with Ethernet interconnect, to a commodity cluster on a chip, i.e., multicore CPU with an on-die interconnect. We project that this “telescoping architecture” will apply more broadly to heterogeneous computing, namely from heterogeneous clusters like Tianhe-2 in a machine room to on a chip. To that end, we present an experimental study that extends the notion of telescoping architectures to identify the ideal mixture of compute engines (CEs) and the number of such CEs on a chip to create a heterogeneous “cluster on a chip” (CoC). Specifically, we experiment with heterogeneous architectures that contain single or multiple instances of CPUs, GPUs, Intel MICs, and FPGAs to demonstrate their performance efficacy given continuing advances in hardware technology, software, tools, and run-time support. Index Terms—architecture; microprocessor design; heterogeneous computing; dwarfs; motifs; system on a chip;

Browse

Browsing Computer Science Technical Reports by Subject "Computer systems"

Results Per Page

Sort Options