ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Published in Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2026

We propose ZipServ, a hardware‑aware lossless compression framework for large language model (LLM) inference. ZipServ co‑optimizes compression formats, memory layout, and GPU execution to reduce memory footprint and bandwidth pressure while preserving exact numerical correctness. Our evaluation on state‑of‑the‑art LLMs demonstrates significant speedups and memory savings over existing inference systems, without sacrificing output quality.

Recommended citation: R. Fan et al., "ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression," in ASPLOS 2026. (CCF-A)
Download Paper | Code | Download Bibtex

Ruibo Fan