Computer Science Technical Reports
Permanent URI for this collection
The Department of Computer Science collection of technical
reports began in 1973. Please use the subject headings listed below for all submissions.
Subject Headings:
- Algorithms
- Big Data
- Bioinformatics
- Computational Biology
- Computational Science and Engineering
- Computer Graphics/Animation
- Computer Science Education
- Computer Systems
- Cyberarts
- Cybersecurity
- Data and Text Mining
- Digital Education
- Digital Libraries
- Discrete Event Simulation
- High Performance Computing
- Human Computer Interaction
- Information Retrieval
- Machine Learning
- Mathematical Programming
- Mathematical Software
- Modeling and Simulation
- Networking
- Numerical Analysis
- Parallel and Distributed Computing
- Problem Solving Environments
- Software Engineering
- Theoretical Computer Science
- Virtual/Augmented Reality
- Visualization
Browse
Browsing Computer Science Technical Reports by Subject "Architecture"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
- Accelerating Workloads on FPGAs via OpenCL: A Case Study with OpenDwarfsVerma, Anshuman; Helal, Ahmed E.; Krommydas, Konstantinos; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance comes at the expense of programmability. FPGA developers use hardware design languages (HDLs) to implement the application data and control path and to design hardware modules for computational pipelines, memory management, synchronization, and communication. This process requires extensive knowledge of logic design, design automation tools, and low-level details of FPGA architecture, this consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. Although this significantly improved programmability yet an optimized GPU implementation of kernel may lack performance portability for FPGA. To improve the performance of OpenCL kernels on FPGAs we identify general techniques to optimize OpenCL kernels for FPGAs under device-specific hardware constraints. We then apply these optimizations techniques to the OpenDwarfs benchmark suite, which has diverse parallelism profiles and memory access patterns, in order to evaluate the effectiveness of the optimizations in terms of performance and resource utilization. Finally, we present the performance of structured grids and N-body dwarf-based benchmarks in the context of various optimization along with their potential re-factoring. We find that careful design of kernels for FPGA can result in a highly efficient pipeline achieving 91% of theoretical throughput for the structured grids dwarf. Index Terms—OpenDwarfs; FPGA; OpenCL; GPU; MIC; Accelerators; Performance Portability
- AutoMatch: Automated Matching of Compute Kernels to Heterogeneous HPC ArchitecturesHelal, Ahmed E.; Feng, Wu-chun; Jung, Changhee; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-12-13)HPC systems contain a wide variety of heterogeneous computing resources, ranging from general-purpose CPUs to specialized accelerators. Porting sequential applications to such systems for achieving high performance requires significant software and hardware expertise as well as extensive manual analysis of both the target architectures and applications to decide the best performing architecture and implementation technique for each application. To streamline this tedious process, this paper presents AutoMatch, a tool for automated matching of compute kernels to heterogeneous HPC architectures. AutoMatch analyzes the sequential application code and automatically predicts the performance of the best parallel implementation of its compute kernels on different hardware architectures. AutoMatch leverages such prediction results to identify the best device for each kernel from a set of devices including multi-core CPUs and many-core GPUs. In addition, it estimates the relative execution cost between the different architectures to drive a workload distribution scheme, which enables end users to efficiently exploit the available compute resources across multiple heterogeneous architectures. We demonstrate the efficacy of AutoMatch, using a set of open-source HPC applications and benchmarks with different parallelism profiles and memory-access patterns. The empirical evaluation shows that AutoMatch is highly accurate across five different heterogeneous architectures, identifying the best architecture for each workload in 96% of the test cases, and its workload distribution scheme has a comparable performance to a profiling-driven oracle.
- Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfsKrommydas, Konstantinos; Helal, Ahmed E.; Verma, Anshuman; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use hardware design languages (HDLs) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance; that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we identify general techniques to optimize OpenCL kernels for FPGAs under device-specific hardware constraints. We then apply these optimization techniques to the OpenDwarfs benchmark suite, with its diverse parallelism profiles and memory access patterns, in order to evaluate the effectiveness of the optimizations in terms of performance and resource utilization. Finally, we present the performance of the optimized OpenDwarfs, along with their potential re-factoring, to bridge the performance gap from programming in OpenCL versus programming in a HDL. Index Terms—OpenDwarfs; FPGA; OpenCL; GPU; GPGPU; MIC; Accelerators; Performance Portability
- Telescoping Architectures: A Methodology for Evaluating Next-Generation Heterogeneous ComputingKrommydas, Konstantinos; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-05-13)Architectural innovation has telescoped the HPC community from the commodity (Beowulf) cluster in a machine room, i.e., a multi-node system with Ethernet interconnect, to a commodity cluster on a chip, i.e., multicore CPU with an on-die interconnect. We project that this “telescoping architecture” will apply more broadly to heterogeneous computing, namely from heterogeneous clusters like Tianhe-2 in a machine room to on a chip. To that end, we present an experimental study that extends the notion of telescoping architectures to identify the ideal mixture of compute engines (CEs) and the number of such CEs on a chip to create a heterogeneous “cluster on a chip” (CoC). Specifically, we experiment with heterogeneous architectures that contain single or multiple instances of CPUs, GPUs, Intel MICs, and FPGAs to demonstrate their performance efficacy given continuing advances in hardware technology, software, tools, and run-time support. Index Terms—architecture; microprocessor design; heterogeneous computing; dwarfs; motifs; system on a chip;