Browsing by Author "Gondhalekar, Atharva"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
- Characterization of Sparsity-aware Optimization Paths for Graph Traversal on FPGAGondhalekar, Atharva (Virginia Tech, 2023-05-25)Breath-first search (BFS) is a fundamental building block in many graph-based applications, but it is difficult to optimize for a field-programmable gate array (FPGA) due to its irregular memory-access patterns. Prior work, based on hardware description languages (HDLs) and high-level synthesis (HLS), address the memory-access bottleneck of BFS by using techniques such as data alignment and compute-unit replication on FPGAs. The efficacy of such optimizations depends on factors such as the sparsity of target graph datasets. Optimizations intended for sparse graphs may not work as effectively for dense graphs on an FPGA and vice versa. This thesis presents two sets of FPGA optimization strategies for BFS, one for near-hypersparse graphs and the other designed for sparse to moderately dense graphs. For near-hypersparse graphs, a queue-based kernel with maximal use of local memory on FPGA is implemented. For denser graphs, an array-based kernel with compute-unit replication is implemented. Across a diverse collection of graphs, our OpenCL optimization strategies for near-hypersparse graphs delivers a 5.7x to 22.3x speedup over a state-of-the-art OpenCL implementation, when evaluated on an Intel Stratix~10 FPGA. The optimization strategies for sparse to moderately dense graphs deliver 1.1x to 2.3x speedup over a state-of-the-art OpenCL implementation on the same FPGA. Finally, this work uses graph metrics such as average degree and Gini coefficient to observe the impact of graph properties on the performance of the proposed optimization strategies.
- ComputeCOVID19+: Accelerating COVID-19 Diagnosis and Monitoring via High-Performance Deep Learning on CT ImagesGoel, Garvit; Gondhalekar, Atharva; Qi, Jingyuan; Zhang, Zhicheng; Cao, Guohua; Feng, Wu-chun (ACM, 2021-10-05)The COVID-19 pandemic has highlighted the importance of diagnosis and monitoring as early and accurately as possible. However, the reverse-transcription polymerase chain reaction (RT-PCR) test results in two issues: (1) protracted turnaround time from sample collection to testing result and (2) compromised test accuracy, as low as 67%, due to when and how the samples are collected, packaged, and delivered to the lab to conduct the RT-PCR test. Thus, we present ComputeCOVID19+, our computed tomography-based framework to improve the testing speed and accuracy of COVID-19 (plus its variants) via a deep learning-based network for CT image enhancement called DDnet, short for DenseNet and Deconvolution network. To demonstrate its speed and accuracy, we evaluate ComputeCOVID19+ across several sources of computed tomography (CT) images and on many heterogeneous platforms, including multi-core CPU, many-core GPU, and even FPGA. Our results show that ComputeCOVID19+ can significantly shorten the turnaround time from days to minutes and improve the testing accuracy to 91%.
- Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGASathre, Paul; Gondhalekar, Atharva; Feng, Wu-chun (IEEE, 2022-01-01)Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud de-tection. In particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads. Real-world graph workloads and their associated data structures create irregular computational patterns that compli-cate the realization of high-performance kernels. Given these complications, there does not exist a de facto 'best' architecture, language, or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity. In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven, high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations - a from-scratch edge-centric version and a faithfully-ported commodity GPU implementation - which would have been intractable via a hardware description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled optimizations, both in isolation and in concert - totaling seven distinct synthesized hardware pipelines. Leveraging real-world graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5 x over the initial HLS implementations when all optimizations work in concert.
- On the Characterization of the Performance-Productivity Gap for FPGAGondhalekar, Atharva; Twomey, Thomas; Feng, Wu-chun (IEEE, 2022)Today, FPGA vendors provide a C++/C-based programming environment to enhance programmer productivity over using a hardware-description language at the register-transfer level. The common perception is that this enhanced pro-ductivity comes at the expense of significantly less performance, e.g., as much an order of magnitude worse. To characterize this performance-productivity tradeoff, we propose a new composite metric, II, that quantitatively captures the perceived discrepancy between the performance and productivity of any two given FPGA programming languages, e.g., Verilog vs. OpenCL. We then present the implications of our metric via a case study on the design of a Sobel filter (i.e., edge detector) using three different programming models - Verilog, OpenCL, oneAPI - on an Intel Arria 10 GX FPGA accelerator. Relative to performance, our results show that an optimized OpenCL kernel achieves 84% of the performance of an optimized Verilog version of the code on a 7680×4320 (8K) image. Conversely, relative to productivity, OpenCL offers a 6.1 x improvement in productivity over Verilog, while oneAPI improves the productivity by an additional factor of 1.25 x over OpenCL.
- On the Three P's of Parallel Programming for Heterogeneous Computing: Performance, Productivity, and PortabilityGondhalekar, Atharva; Feng, Wu-chun (IEEE, 2023-01-01)As FPGAs and GPUs continue to make inroads into high-performance computing (HPC), the need for languages and frameworks that offer performance, productivity, and portability across heterogeneous platforms, such as FPGAs and GPUs, continues to grow. OpenCL and SYCL have emerged as frameworks that offer cross-platform functional portability between FPGAs and GPUs. While functional portability across a diverse set of platforms is an important feature of portable frameworks, achieving performance portability often requires vendor and platform-specific optimizations. Achieving performance portability, therefore, comes at the expense of productivity. This paper presents a quantification of the tradeoffs between performance, portability, and productivity of OpenCL and SYCL. It extends and complements our prior work on quantifying performance-productivity tradeoffs between Verilog and OpenCL for the FPGA. In addition to evaluating the performance-productivity tradeoffs between OpenCL and SYCL, this work quantifies the performance portability (PP) of OpenCL and SYCL as well as their code convergence (CC), i.e., a measure of productivity across different platforms (e.g., FPGA and GPU). Using two applications as case studies (i.e., edge detection using the Sobel filter, and graph link prediction using the Jaccard similarity index), we characterize the tradeoffs between performance, portability, and productivity. Our results show that OpenCL and SYCL offer complementary tradeoffs. While OpenCL delivers better performance portability than SYCL, SYCL offers better code convergence and a 1.6 x improvement in source lines of code over OpenCL.