Changxin Li

PhD Student

I am a Ph.D. student in Computer Science at Case Western Reserve University. My research focuses on improving the efficiency of deep learning workloads on modern hardware, with experience in GPU acceleration, compiler optimization, and high-performance computing.

My research focuses on designing hardware-aware optimization methods for sparse CNNs and transformers to achieve low-latency inference on modern accelerators.

Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

2024

Changxin Li , Sanmukh Kuppannagari

40th IEEE International Parallel & Distributed Processing Symposium (IPDPS)

This paper presents a tile-aware sparse attention scheduling framework for improving the efficiency of structured sparse vision transformers on GPUs. The method represents attention masks as adjacency matrices, applies structure-aware reordering to expose dense computation blocks, and uses offline profiling with Integer Linear Programming (ILP) to select optimal tile shapes under hardware constraints. Integrated into models such as Vision Longformer, RegionViT, and DynamicViT, the framework achieves up to 2.1× end-to-end latency speedup over fixed-tile FlashAttention. The results show that aligning sparse attention computation with both sparsity structure and GPU characteristics can substantially improve inference efficiency.

HPC

DOI Code

Optimizing Deployment of Unstructured Group Convolutions for Low Latency Inference

2024

Changxin Li , Sanmukh Kuppannagari

2025 IEEE 32nd International Conference on High Performance Computing, Data, and Analytics (HiPC)

This paper presents an optimization framework for deploying unstructured group convolutions efficiently on GPUs. The method combines Knapsack-based partitioning, Integer Linear Programming (ILP), and matrix reordering strategies to improve load balancing and data reuse for irregular input-output channel connections. Evaluated on ShuffleNet and CondenseNet, the framework achieves up to 1.9× speedup over PyTorch, while reordering-enhanced ILP provides an additional 1.3× improvement. The results highlight the importance of hardware-aware scheduling for accelerating irregular CNN workloads.

HPC

DOI Code

Exploring Algorithmic Design Choices for Low Latency CNN Deployment

2024

Changxin Li , Sanmukh Kuppannagari

2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC)

This paper investigates algorithmic design choices for reducing latency in CNN deployment across diverse hardware platforms. Five convolution algorithms are implemented using SYCL and integrated into VGG16, ResNet101, and InceptionV4 by replacing the standard PyTorch Conv2d operator. Their performance is evaluated at both the layer and model level on GPUs against PyTorch and Intel PyTorch Extension baselines. Results show significant execution-time improvements, demonstrating the effectiveness of algorithm-level optimization for low-latency CNN inference.

HPC

DOI Code

Changxin Li

Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers

Optimizing Deployment of Unstructured Group Convolutions for Low Latency Inference

Exploring Algorithmic Design Choices for Low Latency CNN Deployment

Mentors