ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression
Published in Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026
We propose ZipServ, a hardware‑aware lossless compression framework for large language model (LLM) inference. ZipServ co‑optimizes compression formats, memory layout, and GPU execution to reduce memory footprint and bandwidth pressure while preserving exact numerical correctness. Our evaluation on state‑of‑the‑art LLMs demonstrates significant speedups and memory savings over existing inference systems, without sacrificing output quality.
Recommended citation: R. Fan et al., "ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression," in ASPLOS 2026. (CCF-A)
Download Paper | Code | Download Bibtex
