LLM Compression Benchmarking
Published: June 2026
LLM Compression Benchmarking
Introduction
The Arc of Modern AI — From Rules to Reasoning to Agency
To understand why compressing a large language model matters, it helps to understand where that model sits in the broader landscape of artificial intelligence.
The first wave of practical AI was discriminative AI — systems trained to draw a boundary between categories. A spam filter deciding inbox or junk. A radiology model flagging tumour or clear. An image classifier labelling cat or dog. These systems are powerful within their lane, but they do one thing: they map an input to a label. They do not generate, explain, or reason beyond that boundary.
The second wave — the one that changed everything — is generative AI, and at its core sits the Large Language Model (LLM). An LLM is not a classifier. It is a probability engine trained on vast corpora of human text to predict, token by token, what comes next. The result is a system that can write, reason, translate, summarise, and converse — not because it was explicitly programmed to do any of those things, but because it learned the statistical structure of language at a scale that produces emergent capability. Models like GPT-4, Claude, Gemini, and the open-source Llama family are LLMs.
The third wave — still forming — is agentic AI: systems where an LLM is not just answering a question but acting in the world. It calls tools, browses the web, writes and executes code, and chains multi-step plans autonomously. The LLM becomes the reasoning engine inside a larger loop. Agentic AI is where the field is heading, and it makes the efficiency of the underlying LLM more critical than ever — because an agent that calls a model dozens of times per task cannot afford a model that takes seconds per token.
The Problem This Project Addresses
In plain terms
The most capable open-source AI models are too large to run on a normal computer. This project explores how to shrink them without meaningfully breaking them — and measures exactly how much is lost in the process.
Technically
State-of-the-art open-weight LLMs are distributed in full FP16 (16-bit floating-point) precision, where each of the model's billions of parameters occupies two bytes of memory. A model with 8 billion parameters therefore requires approximately 16 GB of memory at inference time — before accounting for the KV cache (key-value cache: the memory used to store intermediate attention computations across a conversation), which adds further overhead proportional to context length. Consumer GPU (Graphics Processing Unit) hardware typically provides 8–12 GB of VRAM (Video RAM: the dedicated high-speed memory on a GPU). The arithmetic is unforgiving: the model does not fit.
This is not a niche problem. It is the central barrier between the open-source AI ecosystem and the billions of people who do not have access to cloud compute budgets or data-centre hardware. If an LLM cannot run locally, it cannot be private, it cannot be offline, and it cannot be free.
I undertook this project to answer a question I consider practically important: how far can an open-source LLM be compressed before its usefulness meaningfully degrades — and can that threshold be measured rigorously on consumer hardware?
What Is LLM Compression?
LLM compression is the field of techniques that reduce the memory footprint — and often increase the inference speed — of a trained language model, without retraining it from scratch. Three principal strategies exist in current research and deployment:
1. Quantization
Replaces high-precision floating-point weights with lower-precision integers (e.g. FP16 to INT4), drastically reducing memory. It introduces quantization error, which is mitigated by methods like GPTQ, AWQ, and GGUF mixed precision.
2. Pruning
Removes weights, neurons, or entire layers that contribute least to the model's outputs. Structured pruning removes whole components, while unstructured pruning removes individual weights but requires specialised hardware.
3. Distillation
Trains a smaller student model to mimic the output distribution and confidence of a larger teacher model. It produces a compact model that punches above its weight but requires significant compute and a full training run.
Why Quantization — and Why GGUF?
I chose quantization for three primary reasons:
- No retraining required: Pruning and distillation demand compute resources not available on consumer hardware. Quantization is applied post-training, in minutes.
- Mature, free tooling: The
llama.cpplibrary provides a complete, well-documented pipeline to convert from HuggingFace format to GGUF, quantize, and run inference locally without a paid account. - Well-characterised tradeoff: Recent empirical work shows that 4-bit K-quants retain approximately 96–98% of FP16 quality while reducing model size by 3× or more.
GGUF Quantization Levels Selected
- Q8_0: 8-bit quantization. Near-lossless. Largest of the three compressed variants.
- Q5_K_M: 5.7-bit mixed precision. The practical sweet spot for GPUs in the 8–12 GB VRAM range.
- Q4_K_M: 4.9-bit mixed precision. Maximum compression while remaining within acceptable quality bounds.
The FP16 original serves as the uncompressed baseline against which all three are measured.
Methods
Overview
This project is structured as a controlled local experiment. I run the same model at four precision levels — FP16, Q8_0, Q5_K_M, and Q4_K_M — on identical hardware under identical conditions, and measure three dimensions: inference speed (tokens per second), memory consumption (VRAM in GB), and output quality (manually scored on a fixed prompt set). The results are compiled into a structured data file and visualised as static charts embedded in a portfolio page.
No cloud compute is used at any stage. No external APIs are called during inference. The entire pipeline — from model download to chart generation — runs on a single local machine.
Implementation Methodology
Phase 0: Environment Preparation
Configuring WSL2 and CUDA
Phase 1: Baseline
Llama-3.1 8B Instruct
Phase 2: Quantization
Q8_0, Q5_K_M, Q4_K_M
Phase 3: Benchmarking
Speed, Memory, Quality
Phase 4: Chart Generation
Visualising tradeoffs
Phase 5: Portfolio Integration
SvelteKit Portfolio