Case Western Reserve University
Changxin Li

Changxin Li

PhD Student

Filter:

Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

2024

40th IEEE International Parallel & Distributed Processing Symposium (IPDPS)

This paper presents a tile-aware sparse attention scheduling framework for improving the efficiency of structured sparse vision transformers on GPUs. The method represents attention masks as adjacency matrices, applies structure-aware reordering to expose dense computation blocks, and uses offline profiling with Integer Linear Programming (ILP) to select optimal tile shapes under hardware constraints. Integrated into models such as Vision Longformer, RegionViT, and DynamicViT, the framework achieves up to 2.1× end-to-end latency speedup over fixed-tile FlashAttention. The results show that aligning sparse attention computation with both sparsity structure and GPU characteristics can substantially improve inference efficiency.

HPC

Optimizing Deployment of Unstructured Group Convolutions for Low Latency Inference

2024

2025 IEEE 32nd International Conference on High Performance Computing, Data, and Analytics (HiPC)

This paper presents an optimization framework for deploying unstructured group convolutions efficiently on GPUs. The method combines Knapsack-based partitioning, Integer Linear Programming (ILP), and matrix reordering strategies to improve load balancing and data reuse for irregular input-output channel connections. Evaluated on ShuffleNet and CondenseNet, the framework achieves up to 1.9× speedup over PyTorch, while reordering-enhanced ILP provides an additional 1.3× improvement. The results highlight the importance of hardware-aware scheduling for accelerating irregular CNN workloads.

HPC

Exploring Algorithmic Design Choices for Low Latency CNN Deployment

2024

2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC)

This paper investigates algorithmic design choices for reducing latency in CNN deployment across diverse hardware platforms. Five convolution algorithms are implemented using SYCL and integrated into VGG16, ResNet101, and InceptionV4 by replacing the standard PyTorch Conv2d operator. Their performance is evaluated at both the layer and model level on GPUs against PyTorch and Intel PyTorch Extension baselines. Results show significant execution-time improvements, demonstrating the effectiveness of algorithm-level optimization for low-latency CNN inference.

HPC

Mentors