Browsing by Author "Feng, Wu-Chun"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
- A 3D Deep Learning Architecture for Denoising Low-Dose CT ScansKasparian, Armen Caspar (Virginia Tech, 2024-04-11)This paper introduces 3D-DDnet, a cutting-edge 3D deep learning (DL) framework designed to improve the image quality of low-dose computed tomography (LDCT) scans. Although LDCT scans are advantageous for reducing radiation exposure, they inherently suffer from reduced image quality. Our novel 3D DL architecture addresses this issue by effectively enhancing LDCT images to achieve parity with the quality of standard-dose CT scans. By exploiting the inter-slice correlation present in volumetric CT data, 3D-DDnet surpasses existing denoising benchmarks. It incorporates distributed data parallel (DDP) and transfer learning techniques to significantly accelerate the training process. The DDP approach is particularly tailored for operation across multiple Nvidia A100 GPUs, facilitating the processing of large-scale volumetric data sets that were previously unmanageable due to size constraints. Comparative analyses demonstrate that 3D-DDnet reduces the mean square error (MSE) by 10% over its 2D counterpart, 2D-DDnet. Moreover, by applying transfer learning from pre-trained 2D models, 3D-DDnet effectively 'jump starts' the learning process, cutting training times by half without compromising on model accuracy.
- BLP: Block-Level Pipelining for GPUsFeng, Wu-Chun; Cui, Xuewen; Scogland, Thomas; De Supinski, Bronis (ACM, 2024-05-07)Programming models like OpenMP offer expressive interfaces to program graphics processing units (GPUs) via directive-based off-load. By default, these models copy data to or from the device without overlapping computation, thus impacting performance. Rather than leave the onerous task of manually pipelining and tuning data communication and computation to the end user, we propose an OpenMP extension that supports block-level pipelining and, in turn, present our block-level pipelining (BLP) approach that overlaps data communication and computation in a single kernel. BLP uses persistent thread blocks with cooperative thread groups to process sub-tasks on different streaming multiprocessors and uses GPU flag arrays to enforce task dependencies without CPU involvement. To demonstrate the efficacy of BLP, we evaluate its performance using multiple benchmarks on NVIDIA V100 GPUs. Our experimental results show that BLP achieves 95% to 114% of the performance of hand-tuned kernel-level pipelining. In addition, using BLP with buffer mapping can reduce memory usage to support GPU memory oversubscription.We also show that BLP can reduce memory usage by 75% to 86% for data sets that exceed GPU memory while providing significantly better performance than CUDA Unified Memory (UM) with prefetching optimizations.