VTechWorks Repository :: Browsing by Author "Cameron, Kirk W."

Browsing by Author "Cameron, Kirk W."

Now showing 1 - 20 of 42

Accelerated Storage Systems
Khasymski, Aleksandr Sergeev (Virginia Tech, 2015-03-11)
Today's large-scale, high-performance, data-intensive applications put a tremendous stress on data centers to store, index, and retrieve large amounts of data. Exemplified by technologies such as social media, photo and video sharing, and e-commerce, the rise of the real-time web demands data stores support minimal latencies, always-on availability and ever-growing capacity. These requirements have fostered the development of a large number of high-performance storage systems, arguably the most important of which are Key-Value (KV) stores. An emerging trend for achieving low latency and high throughput in this space is a solution, which utilizes both DRAM and flash by storing an efficient index for the data in memory and minimizing accesses to flash, where both keys and values are stored. Many proposals have examined how to improve KV store performance in this area. However, these systems have shortcomings, including expensive sorting and excessive read and write amplification, which is detrimental to the life of the flash. Another trend in recent years equips large scale deployments with energy-efficient, high performance co-processors, such as Graphics Processing Units (GPUs). Recent work has explored using GPUs to accelerate compute-intensive I/O workloads, including RAID parity generation, encryption, and compression. While this research has proven the viability of GPUs to accelerate these workloads, we argue that there are significant benefits to be had by developing methods and data structures for deep integration of GPUs inside the storage stack, in order to achieve better performance, scalability, and reliability. In this dissertation, we propose comprehensive frameworks that leverage emerging technologies, such as GPUs and flash-based SSDs, to accelerate modern storage systems. For our accelerator-based solution, we focus on developing a system that features deep integration of the GPU in a distributed parallel file system. We utilize a framework that builds on the resources available in the file system and coordinates the workload in such a way that minimizes data movement across the PCIe bus, while exposing data parallelism to maximize the potential for acceleration on the GPU. Our research aims to improve the overall reliability of a PFS by developing a distributed per-file parity generation that provides end-to-end data integrity and unprecedented flexibility. Finally, we design a high-performance KV store utilizing a novel data structure tailored to specific flash requirements; it arranges data on flash in such a way as to minimize write amplification, which is detrimental to the flash cells. The system delivers outstanding read amplification through the use of a trie index and false positive filter.
An Adaptive Framework for Managing Heterogeneous Many-Core Clusters
Rafique, Muhammad Mustafa (Virginia Tech, 2011-09-22)
The computing needs and the input and result datasets of modern scientific and enterprise applications are growing exponentially. To support such applications, High-Performance Computing (HPC) systems need to employ thousands of cores and innovative data management. At the same time, an emerging trend in designing HPC systems is to leverage specialized asymmetric multicores, such as IBM Cell and AMD Fusion APUs, and commodity computational accelerators, such as programmable GPUs, which exhibit excellent price to performance ratio as well as the much needed high energy efficiency. While such accelerators have been studied in detail as stand-alone computational engines, integrating the accelerators into large-scale distributed systems with heterogeneous computing resources for data-intensive computing presents unique challenges and trade-offs. Traditional programming and resource management techniques cannot be directly applied to many-core accelerators in heterogeneous distributed settings, given the complex and custom instruction sets architectures, memory hierarchies and I/O characteristics of different accelerators. In this dissertation, we explore the design space of using commodity accelerators, specifically IBM Cell and programmable GPUs, in distributed settings for data-intensive computing and propose an adaptive framework for programming and managing heterogeneous clusters. The proposed framework provides a MapReduce-based extended programming model for heterogeneous clusters, which distributes tasks between asymmetric compute nodes by considering workload characteristics and capabilities of individual compute nodes. The framework provides efficient data prefetching techniques that leverage general-purpose cores to stage the input data in the private memories of the specialized cores. We also explore the use of an advanced layered-architecture based software engineering approach and provide mixin-layers based reusable software components to enable easy and quick deployment of heterogeneous clusters. The framework also provides multiple resource management and scheduling policies under different constraints, e.g., energy-aware and QoS-aware, to support executing concurrent applications on multi-tenant heterogeneous clusters. When applied to representative applications and benchmarks, our framework yields significantly improved performance in terms of programming efficiency and optimal resource management as compared to conventional, hand-tuned, approaches to program and manage accelerator-based heterogeneous clusters.
Closure: Transforming Source Code for Faster Fuzzing
Paterson, Ian G. (Virginia Tech, 2022-05-27)
Fuzzing, the method of generating inputs to run on a target program while monitoring its execution, is a widely adopted and pragmatic methodology for bug hunting as a means of software hardening. Technical improvements in throughput have shown to be critical to increasing the rate at which new bugs can be discovered time and time again. Persistent fuzzing, which keeps the fuzz target alive via looping, provides increased throughput at the cost for manual development of harnesses to account for invalid states and coverage of the programs code base, while relying on forking to reset the state accrued by looping over the same piece of code multiple times. Stale state can lead to wasted fuzzing efforts as certain areas of code may be conditionally ignored due to a stale global. I propose Closure, a toolset which enables programs to run at persistent speeds while avoiding the downsides of stale state and other bottlenecks associated with persistent fuzzing.
COPS: A Framework for Consumer Oriented Proportional-share Scheduling
Deodhar, Abhijit Anant (Virginia Tech, 2007-05-14)
Scheduling forms an important aspect of operating systems because it has a direct impact on system performance. Most existing general-purpose schedulers use a priority-based scheme to schedule processes. Such priority-based mechanisms cannot guarantee proportional fairness for every process. Proportional share schedulers maintain fairness among tasks based on given weight values. In both of these scheduler types, the scheduling decision is done per-process. However, system usage policies are typically set on a per-consumer basis, where a consumer represents a group of related processes that may belong to the same application or user. The COPS framework uses the idea of consumer sets to group processes. Its design guarantees system usage per consumer, based on relative weights. We have added a share management layer on top of a proportional share scheduler to ease the administrative job of share assignment for these consumer sets. We have evaluated our system in real world scenarios and show that the CPU usage for consumer sets with CPU-bound processes complies with the administrator-defined policy goals.
CPU MISER: A Performance-Directed Run-Time System for Power Aware Cluster
Ge, Rong; Feng, Xizhou; Feng, Wu-chun; Cameron, Kirk W. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)
Performance and power are two primary design constraints in today’s high-end computing systems. Because of the inherent dependency between performance and power, reducing power consumption without impacting system performance is a challenge for the HPC community. In this paper, we present a run-time system as well as its underlying performance model for performance-directed, power-aware cluster computing. Experimental results based on physical measurements show that NPB benchmarks benefit up to 36% energy saving and 21% performance gain. On average, our run-time system leads to 10.7% energy saving with 1.2% performance loss over 9 NPB benchmarks, and is 1.59X improvement in ED2P than CPUSPEED. We also show that our system is performance directed in the sense that the performance loss for most application is within the user specified limit. We attribute the promising results to the accurate performance modeling and prediction, and effective performance control techniques.
Designing Practical Software Bug Detectors Using Commodity Hardware and Common Programming Patterns
Zhang, Tong (Virginia Tech, 2020-01-13)
Software bugs can cost millions and affect people's daily lives. However, many bug detection tools are not always practical in reality, which hinders their wide adoption. There are three main concerns regarding existing bug detectors: 1) run-time overhead in dynamic bug detectors, 2) space overhead in dynamic bug detectors, and 3) scalability and precision issues in static bug detectors. With those in mind, we propose to: 1) leverage commodity hardware to reduce run-time overhead, 2) reuse metadata maintained by one bug detector to detect other types of bugs, reducing space overhead, and 3) apply programming idioms to static analyses, improving scalability and precision. We demonstrate the effectiveness of three approaches using data race bugs, memory safety bugs, and permission check bugs, respectively. First, we leverage the commodity hardware transactional memory (HTM) selectively to use the dynamic data race detector only if necessary, thereby reducing the overhead from 11.68x to 4.65x. We then present a production-ready data race detector, which only incurs a 2.6% run-time overhead, by using performance monitoring units (PMUs) for online memory access sampling and offline unsampled memory access reconstruction. Second, for memory safety bugs, which are more common than data races, we provide practical temporal memory safety on top of the spatial memory safety of the Intel MPX in a memory-efficient manner without additional hardware support. We achieve this by reusing the existing metadata and checks already available in the Intel MPX-instrumented applications, thereby offering full memory safety at only 36% memory overhead. Finally, we design a scalable and precise function pointer analysis tool leveraging indirect call usage patterns in the Linux kernel. We applied the tool to the detection of permission check bugs; the detector found 14 previously unknown bugs within a limited time budget.
Designing RDMA-based efficient Communication for GPU Remoting
Bhandare, Shreya Amit (Virginia Tech, 2023-08-24)
The use of General Purpose Graphics Processing Units (GPGPUs) has become crucial for accelerating high-performance applications. However, the procurement, setup, and maintenance of GPUs can be costly, and their continuous energy consumption poses additional challenges. Moreover, many applications exhibit suboptimal GPU utilization. To address these concerns, GPU virtualization techniques have been proposed. Among them, GPU Remoting stands out as a promising technology that enables applications to transparently harness the computational capabilities of GPUs remotely. GVirtuS, a GPU Remoting software, facilitates transparent and hypervisor-independent access to GPGPUs within virtual machines. This research focuses on the middleware communication layer implemented in GVirtuS and presents a comprehensive redesign that leverages the power of Remote Direct Memory Access (RDMA) technology. Experimental evaluations, conducted using a matrix multiplication application, demonstrate that the newly proposed protocol achieves approximately 50% reduced execution time for data sizes ranging from 1 to 16MB, and around 12% decreased execution time for sizes ranging from 500 to upto 1GB. These findings highlight the significant performance improvements attained through the redesign of the communication layer in GVirtuS, showcasing its potential for enhancing GPU Remoting efficiency.
Energy and Performance Models Enabling Design Space Exploration using Domain Specific Languages
Umar, Mariam (Virginia Tech, 2018-05-25)
With the advent of exascale architectures maximizing performance while maintaining energy consumption within reasonable limits has become one of the most critical design constraints. This constraint is particularly significant in light of the power budget of 20 MWatts set by the U.S. Department of Energy for exascale supercomputing facilities. Therefore, understanding an application's characteristics, execution pattern, energy footprint, and the interactions of such aspects is critical to improving the application's performance as well as its utilization of the underlying resources. With conventional methods of analyzing performance and energy consumption trends scientists are forced to limit themselves to a manageable number of design parameters. While these modeling techniques have catered to the needs of current high-performance computing systems, the complexity and scale of exascale systems demands that large-scale design-space-exploration techniques are developed to enable comprehensive analysis and evaluations. In this dissertation we present research on performance and energy modeling of current high performance computing and future exascale systems. Our thesis is focused on the design space exploration of current and future architectures, in terms of their reconfigurability, application's sensitivity to hardware characteristics (e.g., system clock, memory bandwidth), application's execution patterns, application's communication behavior, and utilization of resources. Our research is aimed at understanding the methods by which we may maximize performance of exascale systems, minimize energy consumption, and understand the trade offs between the two. We use analytical, statistical, and machine-learning approaches to develop accurate, portable and scalable performance and energy models. We develop application and machine abstractions using Aspen (a domain specific language) to implement and evaluate our modeling techniques. As part of our research we develop and evaluate system-level performance and energy-consumption models that form part of an automated modeling framework, which analyzes application signatures to evaluate sensitivity of reconfigurable hardware components for candidate exascale proxy applications. We also develop statistical and machine-learning based models of the application's execution patterns on heterogeneous platforms. We also propose a communication and computation modeling and mapping framework for exascale proxy architectures and evaluate the framework for an exascale proxy application. These models serve as external and internal extensions to Aspen, which enable proxy exascale architecture implementations and thus facilitate design space exploration of exascale systems.
Energy-aware Thread and Data Management in Heterogeneous Multi-Core, Multi-Memory Systems
Su, Chun-Yi (Virginia Tech, 2015-02-03)
By 2004, microprocessor design focused on multicore scaling"increasing the number of cores per die in each generation "as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitive or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems. In addition, I propose an analytical model based on the queuing method that captures important factors in multi-core, multi-memory systems to quantify the tradeoff between performance and energy. The model considers the effect of these factors in a holistic fashion that provides a general view of performance and energy consumption in contemporary systems. Finally, I focus on resource management of future heterogeneous memory systems, which may combine two heterogeneous memories to scale out memory capacity while maintaining reasonable power use. I present a new memory controller design that combines the best aspects of two baseline heterogeneous page management policies to migrate data between two heterogeneous memories so as to optimize performance and energy.
Evaluating MapReduce System Performance: A Simulation Approach
Wang, Guanying (Virginia Tech, 2012-08-27)
Scale of data generated and processed is exploding in the Big Data era. The MapReduce system popularized by open-source Hadoop is a powerful tool for the exploding data problem, and is widely employed in many areas involving large scale of data. In many circumstances, hypothetical MapReduce systems must be evaluated, e.g. to provision a new MapReduce system to provide certain performance goal, to upgrade a currently running system to meet increasing business demands, to evaluate novel network topology, new scheduling algorithms, or resource arrangement schemes. The traditional trial-and-error solution involves the time-consuming and costly process in which a real cluster is first built and then benchmarked. In this dissertation, we propose to simulate MapReduce systems and evaluate hypothetical MapReduce systems using simulation. This simulation approach offers significantly lower turn-around time and lower cost than experiments. Simulation cannot entirely replace experiments, but can be used as a preliminary step to reveal potential flaws and gain critical insights. We studied MapReduce systems in detail and developed a comprehensive performance model for MapReduce, including sub-task phase level performance models for both map and reduce tasks and a model for resource contention between multiple processes running in concurrent. Based on the performance model, we developed a comprehensive simulator for MapReduce, MRPerf. MRPerf is the first full-featured MapReduce simulator. It supports both workload simulation and resource contention, and it still offers the most complete features among all MapReduce simulators to date. Using MRPerf, we conducted two case studies to evaluate scheduling algorithms in MapReduce and shared storage in MapReduce, without building real clusters. Furthermore, in order to further integrate simulation and performance prediction into MapReduce systems and leverage predictions to improve system performance, we developed online prediction framework for MapReduce, which periodically runs simulations within a live Hadoop MapReduce system. The framework can predict task execution within a window in near future. These predictions can be used by other components in MapReduce systems in order to improve performance. Our results show that the framework can achieve high prediction accuracy and incurs negligible overhead. We present two potential use cases, prefetching and dynamic adapting scheduler.
An Evaluation of the Linux Virtual Memory Manager to Determine Suitability for Runtime Variation of Memory
Muthukumaraswamy Sivakumar, Vijay (Virginia Tech, 2007-02-02)
Systems that support virtual memory virtualize the available physical memory such that the applications running on them operate under the assumption that these systems have a larger amount of memory available than is actually present. The memory managers of these systems manage the virtual and the physical address spaces and are responsible for converting the virtual addresses used by the applications to the physical addresses used by the hardware. The memory managers assume that the amount of physical memory is constant and does not change during their period of operation. Some operating scenarios however, such as the power conservation mechanisms and virtual machine monitors, require the ability to vary the physical memory available at runtime, thereby making invalid the assumptions made by these memory managers. In this work we evaluate the suitability of the Linux Memory Manager, which assumes that the available physical memory is constant, for the purposes of varying the memory at run time. We have implemented an infrastructure over the Linux 2.6.11 kernel that enables the user to vary the physical memory available to the system. The available physical memory is logically divided into banks and each bank can be turned on or off independent of the others, using the new system calls we have added to the kernel. Apart from adding support for the new system calls, other changes had to be made to the Linux memory manager to support the runtime variation of memory. To evaluate the suitability for varying memory we have performed experiments with varying memory sizes on both the modified and the unmodified kernels. We have observed that the design of the existing memory manager is not well suited to support the runtime variation of memory; we provide suggestions to make it better suited for such purposes. Even though applications running on systems that support virtual memory do not use the physical memory directly and are not aware of the physical addresses they use, the amount of physical memory available for use affects the performance of the applications. The results of our experiments have helped us study the influence the amount of physical memory available for use has on the performance of various types of applications. These results can be used in scenarios requiring the ability to vary the memory at runtime to do so with least degradation in the application performance.
Exploiting Multigrain Parallelism in Pairwise Sequence Search on Emergent CMP Architectures
Aji, Ashwin Mandayam (Virginia Tech, 2008-05-30)
With the emerging hybrid multi-core and many-core compute platforms delivering unprecedented high performance within a single chip, and making rapid strides toward the commodity processor market, they are widely expected to replace the multi-core processors in the existing High-Performance Computing (HPC) infrastructures, such as large scale clusters, grids and supercomputers. On the other hand in the realm of bioinformatics, the size of genomic databases is doubling every 12 months, and hence the need for novel approaches to parallelize sequence search algorithms has become increasingly important. This thesis puts a significant step forward in bridging the gap between software and hardware by presenting an efficient and scalable model to accelerate one of the popular sequence alignment algorithms by exploiting multigrain parallelism that is exposed by the emerging multiprocessor architectures. Specifically, we parallelize a dynamic programming algorithm called Smith-Waterman both within and across multiple Cell Broadband Engines and within an nVIDIA GeForce General Purpose Graphics Processing Unit (GPGPU). Cell Broadband Engine: We parallelize the Smith-Waterman algorithm within a Cell node by performing a blocked data decomposition of the dynamic programming matrix followed by pipelined execution of the blocks across the synergistic processing elements (SPEs) of the Cell. We also introduce novel optimization methods that completely utilize the vector processing power of the SPE. As a result, we achieve near-linear scalability or near-constant efficiency for up to 16 SPEs on the dual-Cell QS20 blades, and our design is highly scalable to more cores, if available. We further extend this design to accelerate the Smith-Waterman algorithm across nodes on both the IBM QS20 and the PlayStation3 Cell cluster platforms and achieve a maximum speedup of 44, when compared to the execution times on a single Cell node. We then introduce an analytical model to accurately estimate the execution times of parallel sequence alignments and wavefront algorithms in general on the Cell cluster platforms. Lastly, we contribute and evaluate TOSS -- a Throughput-Oriented Sequence Scheduler, which leverages the performance prediction model and dynamically partitions the available processing elements to simultaneously align multiple sequences. This scheme succeeds in aligning more sequences per unit time with an improvement of 33.5% over the naive first-come, first-serve (FCFS) scheduler. nVIDIA GPGPU: We parallelize the Smith-Waterman algorithm on the GPGPU by optimizing the code in stages, which include optimal data layout strategies, coalesced memory accesses and blocked data decomposition techniques. Results show that our methods provide a maximum speedup of 3.6 on the nVIDIA GPGPU when compared to the performance of the naive implementation of Smith-Waterman.
Harpocrates: Privacy-Preserving and Immutable Audit Log for Sensitive Data Operations
Thazhath, Mohit Bhasi (Virginia Tech, 2022-06-10)
The immutability, validity and confidentiality of an audit log is crucial when operating over sensitive data to comply to standard data regulations (e.g., HIPAA). Despite its critical needs, state-of-the-art privacy-preserving audit log schemes (e.g., Ghostor (NSDI '20), Calypso (VLDB '19)) do not fully obtain a high level of privacy, integrity, and immutability simultaneously, in which certain information (e.g., user identities) is still leaked in the log. In this work, we propose Harpocrates, a new privacy-preserving and immutable audit log scheme. Harpocrates permits data store, share, and access operations to be recorded in the audit log without leaking sensitive information (e.g., data identifier, user identity), while permitting the validity of data operations to be publicly verifiable. Harpocrates makes use of blockchain techniques to achieve immutability and avoid a single point of failure, while cryptographic zero-knowledge proofs are harnessed for confidentiality and public verifiability. We analyze the security of our proposed technique and prove that it achieves non-malleability and indistinguishability. We fully implemented Harpocrates and evaluated its performance on a real blockchain system (i.e., Hyperledger Fabric) deployed on a commodity platform (i.e., Amazon EC2). Experimental results demonstrated that Harpocrates is highly scalable and achieves practical performance.
iLORE: A Data Schema for Aggregating Disparate Sources of Computer System and Benchmark Information
Hardy, Nicolas Randell (Virginia Tech, 2021-06-08)
The era of modern computing has been the stage for numerous innovations that have led to cutting edge applications and systems. The characteristics of these systems and applications have been described and quantified by many, however such information is fragmented between various repositories of system and component information. In an effort to collate these disparate collections of information we propose iLORE, an extensible data framework for representing computer systems and their components. We describe the iLORE framework and the pipeline used to aggregate, clean, and insert system and component information into a database that uses iLORE's framework. Additionally, we demonstrate how the database can be used to analyze trends in computing by validating the collected data using previous works, and by showcasing new analyses that were created with said data. Analyses and visualizations created via iLORE are available at csgenome.org.
iLORE: Discovering a Lineage of Microprocessors
Furman, Samuel Lewis (Virginia Tech, 2021-06-29)
Researchers, benchmarking organizations, and hardware manufacturers maintain repositories of computer component and performance information. However, this data is split across many isolated sources and is stored in a form that is not conducive to analysis. A centralized repository of said data would arm stakeholders across industry and academia with a tool to more quantitatively understand the history of computing. We propose iLORE, a data model designed to represent intricate relationships between computer system benchmarks and computer components. We detail the methods we used to implement and populate the iLORE data model using data harvested from publicly available sources. Finally, we demonstrate the validity and utility of our iLORE implementation through an analysis of the characteristics and lineage of commercial microprocessors. We encourage the research community to interact with our data and visualizations at csgenome.org.
Improving the Efficiency of Parallel Applications on Multithreaded and Multicore Systems
Curtis-Maury, Matthew (Virginia Tech, 2008-03-19)
The scalability of parallel applications executing on multithreaded and multicore multiprocessors is often quite limited due to large degrees of contention over shared resources on these systems. In fact, negative scalability frequently occurs such that a non-negligable performance loss is observed through the use of more processors and cores. In this dissertation, we present a prediction model for identifying efficient operating points of concurrency in multithreaded scientific applications in terms of both performance as a primary objective and power secondarily. We also present a runtime system that uses live analysis of hardware event rates through the prediction model to optimize applications dynamically. We discuss a dynamic, phase-aware performance prediction model (DPAPP), which combines statistical learning techniques, including multivariate linear regression and artificial neural networks, with runtime analysis of data collected from hardware event counters to locate optimal operating points of concurrency. We find that the scalability model achieves accuracy approaching 95%, sufficiently accurate to identify improved concurrency levels and thread placements from within real parallel scientific applications. Using DPAPP, we develop a prediction-driven runtime optimization scheme, called ACTOR, which throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each parallel execution phase in an application. ACTOR successfully identifies and exploits program phases where limited scalability results in a performance loss through the use of more processing elements, providing simultaneous reductions in execution time by 5%-18% and power consumption by 0%-11% across a variety of parallel applications and architectures. Further, we extend DPAPP and ACTOR to include support for runtime adaptation of DVFS, allowing for the synergistic exploitation of concurrency throttling and DVFS from within a single, autonomically-acting library, providing improved energy-efficiency compared to either approach in isolation.
Interpolants, Error Bounds, and Mathematical Software for Modeling and Predicting Variability in Computer Systems
Lux, Thomas Christian Hansen (Virginia Tech, 2020-09-23)
Function approximation is an important problem. This work presents applications of interpolants to modeling random variables. Specifically, this work studies the prediction of distributions of random variables applied to computer system throughput variability. Existing approximation methods including multivariate adaptive regression splines, support vector regressors, multilayer perceptrons, Shepard variants, and the Delaunay mesh are investigated in the context of computer variability modeling. New methods of approximation using Box splines, Voronoi cells, and Delaunay for interpolating distributions of data with moderately high dimension are presented and compared with existing approaches. Novel theoretical error bounds are constructed for piecewise linear interpolants over functions with a Lipschitz continuous gradient. Finally, a mathematical software that constructs monotone quintic spline interpolants for distribution approximation from data samples is proposed.
LACE: An Interactive Cluster of Tablet Computers and Kinetic Sculpture to Educate General Audiences on Distributed Blockchain Technologies
Jones, Eles (Virginia Tech, 2022-09-20)
Blockchain technologies and cryptocurrency have made a significant impact on today's computing and financial sectors, and the use cases for blockchain applications are increasing day by day. However, there is little understanding of blockchain and cryptocurrencies amongst the general public. In this work, we present LACE, a kinetic sculpture and decentralized ledger created to educate audiences on the complexities of cryptocurrency creation through a visual form. We discuss the design and implementation of LACE as a modular system constructed of 10 kinetic units, each unit containing an array of Microsoft Surface tablets and one delta robot arm to perform touch based operations on each tablet with a modified stylus. Through this structure, we establish a distributed computing system in which each tablet represents blockchain nodes that maintain copies of the blockchain, mine for new blocks and process transactions through visual software interfaces. Additionally, we implement an interactive gaming module to help audiences understand the work of blockchain creation and the mining process. Finally, we evaluate the LACE project's effectiveness to teach audiences through a detailed questionnaire at the 2022 Accelerate Festival in Washington, DC. We found that 73% of visitors agreed they were able to learn something new from LACE and 82% enjoyed their interaction with LACE.
Managing Memory for Power, Performance, and Thermal Efficiency
Tolentino, Matthew Edward (Virginia Tech, 2009-02-18)
Extraordinary improvements in computing performance, density, and capacity have driven rapid increases in system energy consumption, motivating the need for energy-efficient performance. Harnessing the collective computational capacity of thousands of these systems can consume megawatts of electrical power, even though many systems may be underutilized for extended periods of time. At scale, powering and cooling unused or lightly loaded systems can waste millions of dollars annually. To combat this inefficiency, we propose system software, control systems, and architectural techniques to improve the energy efficiency of high-capacity memory systems while preserving performance. We introduce and discuss several new application-transparent, memory management algorithms as well as a formal analytical model of a power-state control system rooted in classical control theory we developed to proportionally scale memory capacity with application demand. We present a prototype implementation of this control-theoretic runtime system that we evaluate on sequential memory systems. We also present and discuss why the traditional performance-motivated approach of maximizing interleaving within memory systems is problematic and should be revisited in terms of power and thermal efficiency. We then present power-aware control techniques for improving the energy efficiency of symmetrically interleaved memory systems. Given the limitations of traditional interleaved memory configurations, we propose and evaluate unorthodox, asymmetrically interleaved memory configurations. We show that when coupled with our control techniques, significant energy savings can be achieved without sacrificing application performance or memory bandwidth.
Measuring, modeling, and optimizing counterintuitive performance phenomena in power-scalable, parallel systems
Chang, Hung-Ching (Virginia Tech, 2015-04-09)
The demands of exascale computing systems and applications have pushed for a rapid, continual design paradigm coupled with increasing design complexities from the interaction between the application, the middleware, and the underlying system hardware, which forms a breeding ground for inefficiency. This work seeks to improve system efficiency by exposing the root causes of unexpected performance slowdowns (e.g., lower performance at higher processor speeds) that occur more frequently in power-scalable systems where raw processor speed varies. More precisely, we perform an exhaustive empirical study that conclusively shows that increasing processor speed often reduces performance and wastes energy. Our experimental work shows that the frequency of occurrence and magnitude of slowdowns grow with clock frequency and parallelism, indicating that such slowdowns will increasingly be observed with trends in processor and system design. Performance speedups at lower frequencies (or slowdowns at higher frequencies) have been anecdotally observed in the prevailing literature since 2004, but no research has explained nor exploited this phenomenon. This work conclusively demonstrates that performance slowdowns during processor speedup phases can exceed 47% in common I/O workloads. Our hypothesis challenges (and ultimately debunks) a fundamental assumption in computer systems: faster processor speeds result in the same or better performance. In this work, with the use of code and kernel instrumentation, exhaustive experiments, and deep insight into the inner workings of the Linux I/O subsystem, I overcome the aforementioned challenges of variance, complexity, and nondeterminism and identify the I/O resource contention as the root cause of the slowdowns during processor speedup. Specifically, such contention comes from the Linux kernel when the journaling block device (JBD) interacts with the ext3/4 file system that introduces file write delays and file synchronization delays. To fully explain how such I/O contention causes performance anomaly, I propose analytical models of resource contention among I/O threads to describe the root cause of the observed I/O slowdowns when processors speed up. To this end, I introduce LUC, a runtime system to limit the unintended consequences of power scaling and demonstrate the effectiveness of the LUC system for two critical parallel transaction-oriented workloads, including a mail server (varMail) and online transaction processing (oltp).

Browsing by Author "Cameron, Kirk W."

Results Per Page

Sort Options