Computer Science Technical Reports

Permanent URI for this collection

https://hdl.handle.net/10919/19372

The Department of Computer Science collection of technical reports began in 1973. Please use the subject headings listed below for all submissions.

Subject Headings:

Algorithms
Big Data
Bioinformatics
Computational Biology
Computational Science and Engineering
Computer Graphics/Animation
Computer Science Education
Computer Systems
Cyberarts
Cybersecurity
Data and Text Mining
Digital Education
Digital Libraries
Discrete Event Simulation
High Performance Computing
Human Computer Interaction
Information Retrieval
Machine Learning
Mathematical Programming
Mathematical Software
Modeling and Simulation
Networking
Numerical Analysis
Parallel and Distributed Computing
Problem Solving Environments
Software Engineering
Theoretical Computer Science
Virtual/Augmented Reality
Visualization

Browse

Now showing 1 - 20 of 47

Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach
Aji, Ashwin M.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)
The general-purpose graphics processing unit (GPGPU) continues to make significant strides in high-end computing by delivering unprecedented performance at a commodity price. However, the many-core architecture of the GPGPU currently allows only data-parallel applications to extract the full potential out of the hardware. Applications that require frequent synchronization during their execution do not experience much performance gain out of the GPGPU. This is mainly due to the lack of explicit hardware or software support for inter thread communication across the entire GPGPU chip. In this paper, we design, implement, and evaluate a highly-efficient software barrier that synchronizes all the thread blocks running on an ofﬂoaded kernel on the GPGPU without having to transfer execution control back to the host processor. We show that our custom software barrier achieves a three-fold performance improvement over the existing approach, i.e., synchronization via the host processor. To illustrate the aforementioned performance benefit, we parallelize a data-serial application, specifically an optimal sequence-search algorithm called Smith-Waterman (SWat), that requires frequent barrier synchronization across the many cores of the nVIDIA GeForce GTX 280 GPGPU. Our parallelization consists of a suite of optimization techniques — optimal data layout, coalesced memory accesses, and blocked data decomposition. Then, when coupled with our custom software-barrier implementation, we achieve nearly a nine-fold speed-up over the serial implementation of SWat. We also show that our solution delivers 25 faster on-chip execution than the na¨ıve implementation.
An Adaptive Noise Filtering Algorithm for AVIRIS Data with Implications for Classiﬁcation Accuracy
Phillips, Rhonda D.; Blinn, Christine E.; Watson, Layne T.; Wynne, Randolph H. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008)
This paper describes a new algorithm used to adaptively ﬁlter a remote sensing dataset based on signal-to-noise ratios (SNRs) once the maximum noise fraction (MNF) has been applied. This algorithm uses Hermite splines to calculate the approximate area underneath the SNR curve as a function of band number, and that area is used to place bands into “bins” with other bands having similar SNRs. A median ﬁlter with a variable sized kernel is then applied to each band, with the same size kernel used for each band in a particular bin. The proposed adaptive ﬁlters are applied to a hyperspectral image generated by the AVIRIS sensor, and results are given for the identiﬁcation of three different pine species located within the study area. The adaptive ﬁltering scheme improves image quality as shown by estimated SNRs, and classiﬁcation accuracies improved by more than 10% on the sample study area, indicating that the proposed methods improve the image quality, thereby aiding in species discrimination.
Algorithm XXX: SHEPPACK: Modiﬁed Shepard Algorithm for Interpolation of Scattered Multivariate Data
Thacker, William I.; Zhang, Jingwei; Watson, Layne T.; Birch, Jeffrey B.; Iyer, Manjula A.; Berry, Michael W. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009)
Scattered data interpolation problems arise in many applications. Shepard’s method for constructing a global interpolant by blending local interpolants using local-support weight functions usually creates reasonable approximations. SHEPPACK is a Fortran 95 package containing ﬁve versions of the modified Shepard algorithm: quadratic (Fortran 95 translations of Algorithms 660, 661, and 798), cubic (Fortran 95 translation of Algorithm 791), and linear variations of the original Shepard algorithm. An option to the linear Shepard code is a statistically robust ﬁt, intended to be used when the data is known to contain outliers. SHEPPACK also includes a hybrid robust piecewise linear estimation algorithm RIPPLE (residual initiated polynomial-time piecewise linear estimation) intended for data from piecewise linear functions in arbitrary dimension m. The main goal of SHEPPACK is to provide users with a single consistent package containing most existing polynomial variations of Shepard’s algorithm. The algorithms target data of different dimensions. The linear Shepard algorithm, robust linear Shepard algorithm, and RIPPLE are the only algorithms in the package that are applicable to arbitrary dimensional data.
Algorithm XXX: VTDIRECT95: Serial and Parallel Codes for the Global Optimization Algorithm DIRECT
He, Jian; Watson, Layne T.; Sosonkina, Masha (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)
VTDIRECT95 is a Fortran 95 implementation of D.R. Jones' deterministic global optimization algorithm called DIRECT, which is widely used in multidisciplinary engineering design, biological science, and physical science applications. The package includes both a serial code and a data-distributed massively parallel code for different problem scales and optimization (exploration vs. exploitation) goals. Dynamic data structures are used to organize local data, handle unpredictable memory requirements, reduce the memory usage, and share the data across multiple processors. The parallel code employs a multilevel functional and data parallelism to boost concurrency and mitigate the data dependency, thus improving the load balancing and scalability. In addition, checkpointing features are integrated into both versions to provide fault tolerance and hot restarts. Important alogrithm modifications and design considerations are discussed regarding data structures, parallel schemes, error handling, and portability. Using several benchmark functions and real-world applications, the software is evaluated on different systems in terms of optimization effectiveness, data structure efficency, parallel performance, and checkpointing overhead. The package organization and usage are also described in detail.
Algorithms for Feature Selection in Rank-Order Spaces
Slotta, Douglas J.; Vergara, John Paul C.; Ramakrishnan, Naren; Heath, Lenwood S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2005)
The problem of feature selection in supervised learning situations is considered, where all features are drawn from a common domain and are best interpreted via ordinal comparisons with other features, rather than as numerical values. In particular, each instance is a member of a space of ranked features. This problem is pertinent in electoral, financial, and bioinformatics contexts, where features denote assessments in terms of counts, ratings, or rankings. Four algorithms for feature selection in such rank-order spaces are presented; two are information-theoretic, and two are order-theoretic. These algorithms are empirically evaluated against both synthetic and real world datasets. The main results of this paper are (i) characterization of relationships and equivalences between different feature selection strategies with respect to the spaces in which they operate, and the distributions they seek to approximate; (ii) identification of computationally simple and efficient strategies that perform surprisingly well; and (iii) a feasibility study of order-theoretic feature selection for large scale datasets.
Algorithms for Storytelling
Kumar, Deept; Ramakrishnan, Naren; Helm, Richard F.; Potts, Malcolm (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)
We formulate a new data mining problem called "storytelling" as a generalization of redescription mining. In traditional redescription mining, we are given a set of objects and a collection of subsets defined over these objects. The goal is to view the set system as a vocabulary and identify two expressions in this vocabulary that induce the same set of objects. Storytelling, on the other hand, aims to explicitly relate object sets that are disjoint (and hence, maximally dissimilar) by finding a chain of (approximate) redescriptions between the sets. This problem finds applications in bioinformatics, for instance, where the biologist is trying to relate a set of genes expressed in one experiment to another set, implicated in a different pathway. We outline an efficient storytelling implementation that embeds the CARTwheels redescription mining algorithm in an A* search procedure, using the former to supply next move operators on search branches to the latter. This approach is practical and effective for mining large datasets and, at the same time, exploits the structure of partitions imposed by the given vocabulary. Three application case studies are presented: a study of word overlaps in large English dictionaries, exploring connections between genesets in a bioinformatics dataset, and relating publications in the PubMed index of abstracts.
An Application-Oriented Approach for Accelerating Data-Parallel Computation with Graphics Processing Unit
Ponce, Sean; Jing, Huang; Park, Seung In; Khoury, Chase; Quek, Francis; Cao, Yong (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009-03-01)
This paper presents a novel parallelization and quantitative characterization of various optimization strategies for data-parallel computation on a graphics processing unit (GPU) using NVIDIA's new GPU programming framework, Compute Unified Device Architecture (CUDA). CUDA is an easy-to-use development framework that has drawn the attention of many different application areas looking for dramatic speed-ups in their code. However, the performance tradeoffs in CUDA are not yet fully understood, especially for data-parallel applications. Consequently, we study two fundamental mathematical operations that are common in many data-parallel applications: convolution and accumulation. Specifically, we profile and optimize the performance of these operations on a 128-core NVIDIA GPU. We then characterize the impact of these operations on a video-based motion-tracking algorithm called vector coherence mapping, which consists of a series of convolutions and dynamically weighted accumulations, and present a comparison of different implementations and their respective performance profiles.
An Automated Framework for Characterizing and Subsetting GPGPU Workloads
Adhinarayanan, Vignesh; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2015-12-18)
Graphics processing units (GPUs) are becoming increasingly common in today’s computing systems due to their superior performance and energy efficiency relative to their cost. To further improve these desired characteristics, researchers have proposed several software and hardware techniques. Evaluation of these proposed techniques could be tricky due to the ad-hoc nature in which applications are selected for evaluation. Sometimes researchers spend unnecessary time evaluating redundant workloads, which is particularly problematic for time-consuming studies involving simulation. Other times, they fail to expose the shortcomings of their proposed techniques when too few workloads are chosen for evaluation. To overcome these problems, we propose an automated framework that characterizes and subsets GPGPU workloads, depending on a user-chosen set of performance metrics/counters. This framework internally uses principal component analysis (PCA) to reduce the dimensionality of the chosen metrics and then uses hierarchical clustering to identify similarity among the workloads. In this study, we use our framework to identify redundancy in the recently released SPEC ACCEL OpenCL benchmark suite using a few architecture-dependent metrics. Our analysis shows that a subset of eight applications provides most of the diversity in the 19-application benchmark suite. We also subset the Parboil, Rodinia, and SHOC benchmark suites and then compare them against each another to identify “gaps” in these suites. As an example, we show that SHOC has many applications that are similar to each other and could benefit from adding four applications from Parboil to improve its diversity.
Capturing Truthiness: Mining Truth Tables in Binary Datasets
Owens, Clifford Conley; Murali, T. M.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007-02-01)
We introduce a new data mining problem: mining truth tables in binary datasets. Given a matrix of objects and the properties they satisfy, a truth table identifies a subset of properties that exhibit maximal variability (and hence, complete independence) in occurrence patterns over the underlying objects. This problem is relevant in many domains, e.g., bioinformatics where we seek to identify and model independent components of combinatorial regulatory pathways, and in social/economic demographics where we desire to determine independent behavioral attributes of populations. Besides intrinsic interest in such patterns, we show how the problem of mining truth tables is dual to the problem of mining redescriptions, in that a set of properties involved in a truth table cannot participate in any possible redescription. This allows us to adapt our algorithm to the problem of mining redescriptions as well, by first identifying regions where redescriptions cannot happen, and then pursuing a divide and conquer strategy around these regions. Furthermore, our work suggests dual mining strategies where both classes of algorithms can be brought to bear upon either data mining task. We outline a family of levelwise approaches adapted to mining truth tables, algorithmic optimizations, and applications to bioinformatics and political datasets.
A Case Study of Using Domain Analysis for the Conflation Algorithms Domain
Yilmaz, Okan; Frakes, William B. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)
This paper documents the domain engineering process for much of the conflation algorithms domain. Empirical data on the process and products of domain engineering were collected. Six conflation algorithms of four different types: three affix removal, one successor variety, one table lookup, and one n-gram were analyzed. Products of the analysis include a generic architecture, reusable components, a little language and an application generator that extends the scope of the domain analysis beyond previous generators. The application generator produces source code for not only affix removal type but also successor variety, table lookup, and n-gram stemmers. The performance of the stemmers generated automatically was compared with the stemmers developed manually in terms of stem similarity, source and executable sizes, and development and execution times. All five stemmers generated by the application generator produced more than 99.9% identical stems with the manually developed stemmers. Some of the generated stemmers were as efficient as their manual equivalents and some were not.
Cell Cycle Modeling for Budding Yeast with Stochastic Simulation Algorithms
Ahn, Tae-Hyuk; Watson, Layne T.; Cao, Yang; Shaffer, Clifford A.; Baumann, William T. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2008-11-01)
For biochemical systems, where some chemical species are represented by small numbers of molecules, discrete and stochastic approaches are more appropriate than continuous and deterministic approaches. The continuous deterministic approach using ordinary differential equations is adequate for understanding the average behavior of cells, while the discrete stochastic approach accurately captures noisy events in the growth-division cycle. Since the emergence of the stochastic simulation algorithm (SSA) by Gillespie, alternative algorithms have been developed whose goal is to improve the computational efficiency of the SSA. This paper explains and empirically compares the performance of some of these SSA alternatives on a realistic model. The budding yeast cell cycle provides an excellent example of the need for modeling stochastic effects in mathematical modeling of biochemical reactions. This paper presents a stochastic approximation of the cell cycle for budding yeast using Gillespie’s stochastic simulation algorithm. To compare the stochastic results with the average behavior, the simulation must be run thousands of times. Many of the proposed techniques to accelerate the SSA are not effective on the budding yeast problem, because of the scale of the problem or because underlying assumptions are not satisfied. A load balancing algorithm improved overall performance on a parallel supercomputer.
Clustering for Data Reduction: A Divide and Conquer Approach
Andrews, Nicholas O.; Fox, Edward A. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007-10-01)
We consider the problem of reducing a potentially very large dataset to a subset of representative prototypes. Rather than searching over the entire space of prototypes, we first roughly divide the data into balanced clusters using bisecting k-means and spectral cuts, and then find the prototypes for each cluster by affinity propagation. We apply our algorithm to text data, where we perform an order of magnitude faster than simply looking for prototypes on the entire dataset. Furthermore, our "divide and conquer" approach actually performs more accurately on datasets which are well bisected, as the greedy decisions of affinity propagation are confined to classes of already similar items.
CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential Code
Helal, Ahmed E.; Jung, Changhee; Feng, Wu-chun; Hanafy, Yasser Y. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2017-08-14)
MPI+X is the de facto standard for programming applications on HPC clusters. The performance and scalability on such systems is limited by the communication cost on different number of processes and compute nodes. Therefore, the current communication analysis tools play a critical role in the design and development of HPC applications. However, these tools require the availability of the MPI implementation, which might not exist in the early stage of the development process due to the parallel programming effort and time. This paper presents CommAnalyzer, an automated tool for communication model generation from a sequential code. CommAnalyzer uses novel compiler analysis techniques and graph algorithms to capture the inherent communication characteristics of sequential applications, and to estimate their communication cost on HPC systems. The experiments with real-world, regular and irregular scientific applications demonstrate the utility of CommAnalyzer in estimating the communication cost on HPC clusters with more than 95% accuracy on average.
A Composable Workflow for Productive FPGA Computing via Whole-Program Analysis and Transformation (with Code Excerpts)
Sathre, Paul; Helal, Ahmed E.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2018-07-24)
We present a composable workflow to enable highly-productive heterogeneous computing on FPGAs. The workflow consists of a trio of static analysis and transformation tools: (1) a whole-program, source-to-source translator to transform existing parallel code to OpenCL, (2) a set of OpenCL kernel linters, which target FPGAs to detect possible semantic errors and performance traps, and (3) a whole-program OpenCL linter to validate the host-to-device interface of OpenCL programs. The workflow promotes rapid realization of heterogeneous parallel code across a multitude of heterogeneous computing environments, particularly FPGAs, by providing complementary tools for automatic CUDA-to-OpenCL translation and compile-time OpenCL validation in advance of very expensive compilation, placement, and routing on FPGAs. The proposed tools perform whole-program analysis and transformation to tackle realworld, large-scale parallel applications. The efficacy of the workflow tools is demonstrated via a representative translation and analysis of a sizable CUDA finite automata processing engine as well as the analysis and validation of an additional 96 OpenCL benchmarks.
Continuous Iterative Guided Spectral Class Rejection Classification Algorithm: Part 2
Phillips, Rhonda D.; Watson, Layne T.; Wynne, Randolph H.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009)
This paper describes in detail the continuous iterative guided spectral class rejection (CIGSCR) classification method based on the iterative guided spectral class rejection (IGSCR) classification method for remotely sensed data. Both CIGSCR and IGSCR use semisupervised clustering to locate clusters that are associated with classes in a classification scheme. In CIGSCR and IGSCR, training data are used to evaluate the strength of the association between a particular cluster and a class, and a statistical hypothesis test is used to determine which clusters should be associated with a class and used for classification and which clusters should be rejected and possibly reﬁned. Experimental results indicate that the soft classification output by CIGSCR is reasonably accurate (when compared to IGSCR), and the fundamental algorithmic changes in CIGSCR (from IGSCR) result in CIGSCR being less sensitive to input parameters that inﬂuence iterations. Furthermore, evidence is presented that the semisupervised clustering in CIGSCR produces more accurate classifications than classification based on clustering without supervision.
Continuous Iterative Guided Spectral Class Rejection Classiﬁcation Algorithm: Part 1
Phillips, Rhonda D.; Watson, Layne T.; Wynne, Randolph H.; Ramakrishnan, Naren (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009)
This paper outlines the changes necessary to convert the iterative guided spectral class rejection (IGSCR) classification algorithm to a soft classification algorithm. IGSCR uses a hypothesis test to select clusters to use in classification and iteratively reﬁnes clusters not yet selected for classification. Both steps assume that cluster and class memberships are crisp (either zero or one). In order to make soft cluster and class assignments (between zero and one), a new hypothesis test and iterative reﬁnement technique are introduced that are suitable for soft clusters. The new hypothesis test, called the (class) association signiﬁcance test, is based on the normal distribution, and a proof is supplied to show that the assumption of normality is reasonable. Soft clusters are iteratively reﬁned by creating new clusters using information contained in a targeted soft cluster. Soft cluster evaluation and reﬁnement can then be combined to form a soft classification algorithm, continuous iterative guided spectral class rejection (CIGSCR).
Convergence analysis of hybrid cellular automata for topology optimization
Penninger, Charles L.; Watson, Layne T.; Tovar, Andres; Renaud, John E. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2009-03-01)
The hybrid cellular automaton (HCA) algorithm was inspired by the structural adaptation of bones to their ever changing mechanical environment. This methodology has been shown to be an eﬀective topology synthesis tool. In previous work, it has been observed that the convergence of the HCA methodology is aﬀected by parameters of the algorithm. As a result, questions have been raised regarding the conditions by which HCA converges to an optimal design. The objective of this investigation is to examine the conditions that guarantee convergence to a Karush-Kuhn-Tucker (KKT) point. In this paper, it is shown that the HCA algorithm is a ﬁxed point iterative scheme and the previously reported KKT optimality conditions are corrected. To demonstrate the convergence properties of the HCA algorithm, a simple cantilevered beam example is utilized. Plots of the spectral radius for projections of the design space are used to show regions of guaranteed convergence.
Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures
Scogland, Thomas R. W.; Feng, Wu-chun (Department of Computer Science, Virginia Polytechnic Institute & State University, 2014-08-06)
As core counts increase and as heterogeneity becomes more common in parallel computing, we face the prospect of pro gramming hundreds or even thousands of concurrent threads in a single shared-memory system. At these scales, even highly-efficient concurrent algorithms and data structures can become bottlenecks, unless they are designed from the ground up with throughput as their primary goal. In this paper, we present three contributions: (1) a characterization of queue designs in terms of modern multi- and many-core architectures, (2) the design of a high-throughput concurrent FIFO queue for many-core architectures that avoids the bottlenecks common in modern queue designs, and (3) a thorough evaluation of concurrent queue throughput across CPU, GPU, and co-processor devices. Our evaluation shows that focusing on throughput, rather than progress guarantees, allows our queue to scale to as much as three orders of magnitude (1000X) faster than lock-free and combining queues on GPU platforms and two times (2X) faster on CPU devices. These results deliver critical insight into the design of data structures for highly concurrent systems: (1) progress guarantees do not guarantee scalability, and (2) allowing an algorithm to block can actually increase throughput.
Deterministic Global Optimization of Flapping Wing Motion for Micro Air Vehicles
Ghommem, Mehdi; Hajj, Muhammad R.; Watson, Layne T.; Mook, Dean T.; Snyder, Richard D.; Beran, Philip S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2010-12-01)
The kinematics of a flapping plate is optimized by combining the unsteady vortex lattice method with a deterministic global optimization algorithm. The design parameters are the amplitudes, the mean values, the frequencies, and the phase angles of the flapping motion. The results suggest that imposing a delay between the different oscillatory motions and controlling the way through which the wing rotates at the end of each half stroke would enhance the lift generation. The use of a general unsteady numerical aerodynamic model (UVLM) and the implementation of a deterministic global optimization algorithm provide guidance and a baseline for future efforts to identify optimal stroke trajectories for micro air vehicles with higher fidelity models.
Dynamic Data Structures for a Direct Search Algorithm
He, Jian; Watson, Layne T.; Ramakrishnan, Naren; Shaffer, Clifford A.; Verstak, Alex; Jiang, Jing; Bae, Kyung; Tranter, William H. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2001)
The DIRECT (DIviding RECTangles) algorithm of Jones, Perttunen, and Stuckman (1993), a variant of Lipschitzian methods for bound constrained global optimization, has proved effective even in higher dimensions. However, the performance of a DIRECT implementation in real applications depends on the characteristics of the objective function, the problem dimension, and the desired solution accuracy. Implementations with static data structures often fail in practice, since it is difficult to predict memory resource requirements in advance. This is especially critical in multidisciplinary engineering design applications, where the DIRECT optimization is just one small component of a much larger computation, and any component failure aborts the entire design process. To make the DIRECT global optimization algorithm efficient and robust on large-scale, multidisciplinary engineering problems, a set of dynamic data structures is proposed here to balance the memory requirements with execution time, while simultaneously adapting to arbitrary problem size. The focus of this paper is on design issues of the dynamic data structures, and related memory management strategies. Numerical computing techniques and modiﬁcations of Jones’ original DIRECT algorithm in terms of stopping rules and box selection rules are also explored. Performance studies are done for synthetic test problems with multiple local optima. Results for application to a site-specific system simulator for wireless communications systems (S4W) are also presented to demonstrate the effectiveness of the proposed dynamic data structures for an implementation of DIRECT.

Browse

Browsing Computer Science Technical Reports by Subject "Algorithms"

Results Per Page

Sort Options