Computer Science Technical Reports

Permanent URI for this collection

https://hdl.handle.net/10919/19372

The Department of Computer Science collection of technical reports began in 1973. Please use the subject headings listed below for all submissions.

Subject Headings:

Algorithms
Big Data
Bioinformatics
Computational Biology
Computational Science and Engineering
Computer Graphics/Animation
Computer Science Education
Computer Systems
Cyberarts
Cybersecurity
Data and Text Mining
Digital Education
Digital Libraries
Discrete Event Simulation
High Performance Computing
Human Computer Interaction
Information Retrieval
Machine Learning
Mathematical Programming
Mathematical Software
Modeling and Simulation
Networking
Numerical Analysis
Parallel and Distributed Computing
Problem Solving Environments
Software Engineering
Theoretical Computer Science
Virtual/Augmented Reality
Visualization

Browse

Now showing 1 - 8 of 8

AutoMatch: Automated Matching of Compute Kernels to Heterogeneous HPC Architectures
Helal, Ahmed E.; Feng, Wu-chun; Jung, Changhee; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2016-12-13)
HPC systems contain a wide variety of heterogeneous computing resources, ranging from general-purpose CPUs to specialized accelerators. Porting sequential applications to such systems for achieving high performance requires significant software and hardware expertise as well as extensive manual analysis of both the target architectures and applications to decide the best performing architecture and implementation technique for each application. To streamline this tedious process, this paper presents AutoMatch, a tool for automated matching of compute kernels to heterogeneous HPC architectures. AutoMatch analyzes the sequential application code and automatically predicts the performance of the best parallel implementation of its compute kernels on different hardware architectures. AutoMatch leverages such prediction results to identify the best device for each kernel from a set of devices including multi-core CPUs and many-core GPUs. In addition, it estimates the relative execution cost between the different architectures to drive a workload distribution scheme, which enables end users to efficiently exploit the available compute resources across multiple heterogeneous architectures. We demonstrate the efficacy of AutoMatch, using a set of open-source HPC applications and benchmarks with different parallelism profiles and memory-access patterns. The empirical evaluation shows that AutoMatch is highly accurate across five different heterogeneous architectures, identifying the best architecture for each workload in 96% of the test cases, and its workload distribution scheme has a comparable performance to a profiling-driven oracle.
BenchPrime: Accurate Benchmark Subsetting with Optimized Clustering Algorithm Selection
Liu, Qingrui; Wu, Xiaolong; Kittinger, Larry; Levy, Markus; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2018-08-24)
This paper presents BenchPrime, an automated benchmark analysis toolset that is systematic and extensible to analyze the similarity and diversity of benchmark suites. BenchPrime takes multiple benchmark suites and their evaluation metrics as inputs and generates a hybrid benchmark suite comprising only essential applications. Unlike prior work, BenchPrime uses linear discriminant analysis rather than principal component analysis, as well as selects the best clustering algorithm and the optimized number of clusters in an automated and metric-tailored way, thereby achieving high accuracy. In addition, BenchPrime ranks the benchmark suites in terms of their application set diversity and estimates how unique each benchmark suite is compared to other suites. As a case study, this work for the first time compares the DenBench with the MediaBench and MiBench using four different metrics to provide a multi-dimensional understanding of the benchmark suites. For each metric, BenchPrime measures to what degree DenBench applications are irreplaceable with those in MediaBench and MiBench. This provides means for identifying an essential subset from the three benchmark suites without compromising the application balance of the full set. The experimental results show that the necessity of including DenBench applications varies across the target metrics and that significant redundancy exists among the three benchmark suites.
CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential Code
Helal, Ahmed E.; Jung, Changhee; Feng, Wu-chun; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2017-08-14)
MPI+X is the de facto standard for programming applications on HPC clusters. The performance and scalability on such systems is limited by the communication cost on different number of processes and compute nodes. Therefore, the current communication analysis tools play a critical role in the design and development of HPC applications. However, these tools require the availability of the MPI implementation, which might not exist in the early stage of the development process due to the parallel programming effort and time. This paper presents CommAnalyzer, an automated tool for communication model generation from a sequential code. CommAnalyzer uses novel compiler analysis techniques and graph algorithms to capture the inherent communication characteristics of sequential applications, and to estimate their communication cost on HPC systems. The experiments with real-world, regular and irregular scientific applications demonstrate the utility of CommAnalyzer in estimating the communication cost on HPC clusters with more than 95% accuracy on average.
Compiler-Directed Failure Atomicity for Nonvolatile Memory
Liu, Qingrui; Izraelevitz, Joseph; Lee, Se Kwon; Scott, Michael L.; Noh, Sam H.; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2019-07-15)
This paper presents iDO, a compiler-directed approach to failure atomicity with nonvolatile memory. Unlike most prior work, which instruments each store of persistent data for redo or undo logging, the iDO compiler identifies idempotent instruction sequences, whose re-execution is guaranteed to be side effect-free, thereby eliminating the need to log every persistent store. Using an extension of prior work on JUSTDO logging, the compiler then arranges, during recovery from failure, to back up each thread to the beginning of the current idempotent region and re-execute to the end of the current failure-atomic section. This extension transforms JUSTDO logging from a technique of value only on hypothetical future machines with nonvolatile caches into a technique that also significantly outperforms state-of-the art lock-based persistence mechanisms on current hardware during normal execution, while preserving very fast recovery times.
A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)
Sathre, Paul; Helal, Ahmed E.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2018-07-24)
We present a composable workflow to enable highly-productive heterogeneous computing on FPGAs. The workflow consists of a trio of static analysis and transformation tools: (1) a whole-program, source-to-source translator to transform existing parallel code to OpenCL, (2) a set of OpenCL kernel linters, which target FPGAs to detect possible semantic errors and performance traps, and (3) a whole-program OpenCL linter to validate the host-to-device interface of OpenCL programs. The workflow promotes rapid realization of heterogeneous parallel code across a multitude of heterogeneous computing environments, particularly FPGAs, by providing complementary tools for automatic CUDA-to-OpenCL translation and compile-time OpenCL validation in advance of very expensive compilation, placement, and routing on FPGAs. The proposed tools perform whole-program analysis and transformation to tackle realworld, large-scale parallel applications. The efficacy of the workflow tools is demonstrated via a representative translation and analysis of a sizable CUDA finite automata processing engine as well as the analysis and validation of an additional 96 OpenCL benchmarks.
ELASTIN: Achieving Stagnation-Free Intermittent Computation with Boundary-Free Adaptive Execution
Choi, Jongouk; Joe, Hyunwoo; Kim, Yongjoo; Jung, Changhee (Department of Computer Science, Virginia Polytechnic Institute & State University, 2019-07-15)
This paper presents ELASTIN, a stagnation-free intermittent computing system for energy-harvesting devices that ensures forward progress in the presence of frequent power outages without partitioning program into recoverable regions or tasks. ELASTIN leverages both timer-based checkpointing of volatile registers and copy-on-write mappings of nonvolatile memory pages to restore them in the wake of power failure. During each checkpoint interval, ELASTIN tracks memory writes on a per-page basis and backs up the original page using custom software-controlled memory protection without MMU or TLB. When a new interval starts at each timer expiration, ELASTIN clears the write permission of all the pages written during the previous interval and checkpoints all registers including a program counter as a recovery point. In particular, ELASTIN dynamically reconfigures both the checkpoint interval and the page size to achieve stagnation-free intermittent computation and maximize forward progress across power outages. The experiments on TI’s MSP430 board with energy harvesting traces show that ELASTIN outperforms the state-of-the-art scheme by 3.5X on average (up to orders of magnitude speedup) and guarantees forward progress.
ETH: A Framework for the Design-Space Exploration of Extreme-Scale Visualization
Abrams, Gregory; Adhinarayanan, Vignesh; Feng, Wu-chun; Rogers, David; Ahrens, Jams; Wilson, Luke (Department of Computer Science, Virginia Polytechnic Institute & State University, 2017-09-29)
As high-performance computing (HPC) moves towards the exascale era, large-scale scientific simulations are generating enormous datasets. A variety of techniques (e.g., in-situ methods, data sampling, and compression) have been proposed to help visualize these large datasets under various constraints such as storage, power, and energy. However, evaluating these techniques and understanding the various trade-offs (e.g., performance, efficiency, quality) remains a challenging task. To enable the investigation and optimization across such tradeoffs, we propose a toolkit for the early-stage exploration of visualization and rendering approaches, job layout, and visualization pipelines. Our framework covers a broader parameter space than existing visualization applications such as ParaView and VisIt. It also promotes the study of simulation-visualization coupling strategies through a data-centric approach, rather than requiring the code itself. Furthermore, with experimentation on an extensively instrumented supercomputer, we study more metrics of interest than was previously possible. Overall, our framework will help to answer important what-if scenarios and trade-off questions in early stages of pipeline development, helping scientists to make informed choices about how to best couple a simulation code with visualization at extreme scale.
MOANA: Modeling and Analyzing I/O Variability in Parallel System Experimental Design
Cameron, Kirk W.; Anwar, Ali; Cheng, Yue; Xu, Li; Li, Bo; Ananth, Uday; Lux, Thomas; Hong, Yili; Watson, Layne T.; Butt, Ali R. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2018-04-19)
Exponential increases in complexity and scale make variability a growing threat to sustaining HPC performance at exascale. Performance variability in HPC I/O is common, acute, and formidable. We take the first step towards comprehensively studying linear and nonlinear approaches to modeling HPC I/O system variability. We create a modeling and analysis approach (MOANA) that predicts HPC I/O variability for thousands of software and hardware configurations on highly parallel shared-memory systems. Our findings indicate nonlinear approaches to I/O variability prediction are an order of magnitude more accurate than linear regression techniques. We demonstrate the use of MOANA to accurately predict the confidence intervals of unmeasured I/O system configurations for a given number of repeat runs – enabling users to quantitatively balance experiment duration with statistical confidence.

Browse

Browsing Computer Science Technical Reports by Subject "Computer Systems"

Results Per Page

Sort Options