Exploiting Low-Level Sparsity for Efficient Large Language Model Inference on GPUs with SpInfer

Published in ACM Transactions on Computer Systems (TOCS), 2025

This journal extension of SpInfer provides a more comprehensive study of exploiting low-level sparsity for efficient LLM inference on GPUs. It includes extended methodology, deeper analysis, and additional experiments compared to the conference version, further validating the effectiveness and generality of SpInfer.

Recommended citation: **Ruibo Fan**, et al., "Exploiting Low-Level Sparsity for Efficient Large Language Model Inference on GPUs with SpInfer," *ACM Transactions on Computer Systems (TOCS)*, invited, under review.
Download Paper | Code | Download Bibtex