SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs

Published in Proceedings of the Twentieth European Conference on Computer Systems (EuroSys), 2025

SpInfer exploits low-level sparsity in LLM computation on GPUs to accelerate inference. By co-designing sparse kernels, data layouts, and runtime scheduling, SpInfer reduces unnecessary computation and memory traffic while preserving model accuracy, achieving significantly lower latency and higher throughput compared to dense baselines.

Recommended citation: **Ruibo Fan**, et al., "SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs," in *Proceedings of the 20th European Conference on Computer Systems (EuroSys)*, 2025.
Download Paper | Code | Download Bibtex