Achieving Low Latency Inference on High Resolution Images by Exploiting Sparsity in Vision Transformers
2024This paper presents a tile-aware sparse attention scheduling framework for improving the efficiency of structured sparse vision transformers on GPUs. The method represents attention masks as adjacency matrices, applies structure-aware reordering to expose dense computation blocks, and uses offline profiling with Integer Linear Programming (ILP) to select optimal tile shapes under hardware constraints. Integrated into models such as Vision Longformer, RegionViT, and DynamicViT, the framework achieves up to 2.1× end-to-end latency speedup over fixed-tile FlashAttention. The results show that aligning sparse attention computation with both sparsity structure and GPU characteristics can substantially improve inference efficiency.
Optimizing Deployment of Unstructured Group Convolutions for Low Latency Inference
2024This paper presents an optimization framework for deploying unstructured group convolutions efficiently on GPUs. The method combines Knapsack-based partitioning, Integer Linear Programming (ILP), and matrix reordering strategies to improve load balancing and data reuse for irregular input-output channel connections. Evaluated on ShuffleNet and CondenseNet, the framework achieves up to 1.9× speedup over PyTorch, while reordering-enhanced ILP provides an additional 1.3× improvement. The results highlight the importance of hardware-aware scheduling for accelerating irregular CNN workloads.
Exploring Algorithmic Design Choices for Low Latency CNN Deployment
2024This paper investigates algorithmic design choices for reducing latency in CNN deployment across diverse hardware platforms. Five convolution algorithms are implemented using SYCL and integrated into VGG16, ResNet101, and InceptionV4 by replacing the standard PyTorch Conv2d operator. Their performance is evaluated at both the layer and model level on GPUs against PyTorch and Intel PyTorch Extension baselines. Results show significant execution-time improvements, demonstrating the effectiveness of algorithm-level optimization for low-latency CNN inference.