Browsing by Author "Blagojevic, Filip"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- Dynamic Multigrain Parallelization on the Cell Broadband EngineBlagojevic, Filip; Nikolopoulos, Dimitrios S.; Stamatakis, Alexandros; Antonopoulos, Christos D. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present policies and mechanisms for adaptive exploitation and scheduling of multiple layers of parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exposed from the runtime system whenever task-level parallelism leaves cores idle. We present a runtime system for scheduling applications with layered parallelism on Cell and investigate its potential with RAxML, a computational biology application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic parallelization methods, that selectively exploit the layers of parallelism in the system, in response to workload characteristics. Our runtime environment outperforms naive parallelization and scheduling based on MPI and Linux by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5--10\% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor.
- Modeling Multigrain Parallelism on Heterogeneous Multi-core ProcessorsBlagojevic, Filip; Feng, Xizhou; Cameron, Kirk W.; Nikolopoulos, Dimitrios S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)Heterogeneous multi-core processors integrate conventional processing cores with computational accelerators. To maximize performance on these systems, programs must exploit multiple dimensions of parallelism simultaneously, including task-level and data-level parallelism. Unfortunately, parallel program designs with multiple dimensions of parallelism today are ad hoc, resulting in performance that depends heavily on the intuition and skill of the programmer. Formal techniques are needed to optimize parallel program designs. We propose a parallel computational model for steering multi-grain parallelization in heterogeneous multi-core processors. Our model accurately predicts the execution time and scalability of a program using multiple conventional processors and accelerators. The model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency in parallel programs. We use the model to derive mappings of two full computational phylogenetics applications on multi-processors featuring the IBM Cell Broadband Engine.
- Prediction-based Power-Performance Adaptation of Multithreaded Scientific CodesCurtis-Maury, Matthew; Blagojevic, Filip; Antonopoulos, Christos D.; Nikolopoulos, Dimitrios S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2007)Computing is currently at an inflection point, with the degree of on-chip thread-level parallelism doubling every one to two years. The number of cores has become one of the most important architectural parameters that characterize performance and power-efficiency of a modern microprocessor, and a computer system in general. Concurrency lends itself naturally to allowing a program to trade some of its performance for power savings, by regulating the number of active cores. Unfortunately, in several computing domains, users are unwilling to sacrifice performance to save power. Futhermore, the opportunities for saving power via other means, such as voltage and frequency scaling, may be limited in heavily optimized applications. In this paper, we present a prediction model for identifying energy-efficient operating points of concurrency in well-tuned multithreaded scientific applications, and a runtime system which uses live analysis of hardware event rates through the prediction model, to optimize applications dynamically. The runtime system throttles concurrency so that power consumption can be reduced and performance can be set at the knee of the scalability curve of each parallel execution phase. We present a dynamic, phase-aware performance prediction model (DPAPP), which combines multivariate regression techniques with runtime analysis of data collected from hardware event counters, to locate optimal operating points of concurrency. DPAPP is hardware-aware, in the sense that it takes into account the dimensions of parallelism in the architecture, using distinct predictors and hardware events for each dimension. It is also phase-aware. Using DPAPP, we develop a prediction-driven runtime optimization scheme, which drastically reduces the overhead of searching the optimization space for power-performance efficiency, while achieving near-optimal performance and power savings in real parallel applications.
- RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband EngineBlagojevic, Filip; Stamatakis, Alexandros; Antonopoulos, Christos D.; Nikolopoulos, Dimitrios S. (Department of Computer Science, Virginia Polytechnic Institute & State University, 2006)Phylogenetic tree reconstruction is one of the grand challenge problems in Bioinformatics. The search for a best-scoring tree with 50 organisms, under a reasonable optimality criterion, creates a topological search space which is as large as the number of atoms in the universe. Computational phylogeny is challenging even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits various levels of fine and coarse-grain parallelism. In this paper, we present the porting, optimization, and evaluation of RAxML on the Cell Broadband Engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees based on the Maximum Likelihood (ML) method. The algorithm uses an embarrassingly parallel search method, which also exhibits data-level parallelism and control parallelism in the computation of the likelihood functions. We present the optimization of one of the currently fastest tree search algorithms, on a real Cell blade prototype. We also investigate problems and present solutions pertaining to the optimization of floating point code, control flow, communication, scheduling, and multi-level parallelization on the Cell.
- Scheduling on Asymmetric ArchitecturesBlagojevic, Filip (Virginia Tech, 2008-05-30)We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of programmers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically "rightsize" the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model predicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications.