LLM Compression Benchmarking

Published: June 2026

Neuroscience & Cognition Predictive Modeling Cloud & DevOps Statistical Analysis

LLM Compression Benchmarking


Introduction

The Arc of Modern AI — From Rules to Reasoning to Agency

To understand why compressing a large language model matters, it helps to understand where that model sits in the broader landscape of artificial intelligence.

The first wave of practical AI was discriminative AI — systems trained to draw a boundary between categories. A spam filter deciding inbox or junk. A radiology model flagging tumour or clear. An image classifier labelling cat or dog. These systems are powerful within their lane, but they do one thing: they map an input to a label. They do not generate, explain, or reason beyond that boundary.

The second wave — the one that changed everything — is generative AI, and at its core sits the Large Language Model (LLM). An LLM is not a classifier. It is a probability engine trained on vast corpora of human text to predict, token by token, what comes next. The result is a system that can write, reason, translate, summarise, and converse — not because it was explicitly programmed to do any of those things, but because it learned the statistical structure of language at a scale that produces emergent capability. Models like GPT-4, Claude, Gemini, and the open-source Llama family are LLMs.

The third wave — still forming — is agentic AI: systems where an LLM is not just answering a question but acting in the world. It calls tools, browses the web, writes and executes code, and chains multi-step plans autonomously. The LLM becomes the reasoning engine inside a larger loop. Agentic AI is where the field is heading, and it makes the efficiency of the underlying LLM more critical than ever — because an agent that calls a model dozens of times per task cannot afford a model that takes seconds per token.


The Problem This Project Addresses

In plain terms

The most capable open-source AI models are too large to run on a normal computer. This project explores how to shrink them without meaningfully breaking them — and measures exactly how much is lost in the process.

Technically

State-of-the-art open-weight LLMs are distributed in full FP16 (16-bit floating-point) precision, where each of the model's billions of parameters occupies two bytes of memory. A model with 8 billion parameters therefore requires approximately 16 GB of memory at inference time — before accounting for the KV cache (key-value cache: the memory used to store intermediate attention computations across a conversation), which adds further overhead proportional to context length. Consumer GPU (Graphics Processing Unit) hardware typically provides 8–12 GB of VRAM (Video RAM: the dedicated high-speed memory on a GPU). The arithmetic is unforgiving: the model does not fit.

This is not a niche problem. It is the central barrier between the open-source AI ecosystem and the billions of people who do not have access to cloud compute budgets or data-centre hardware. If an LLM cannot run locally, it cannot be private, it cannot be offline, and it cannot be free.

I undertook this project to answer a question I consider practically important: how far can an open-source LLM be compressed before its usefulness meaningfully degrades — and can that threshold be measured rigorously on consumer hardware?


What Is LLM Compression?

LLM compression is the field of techniques that reduce the memory footprint — and often increase the inference speed — of a trained language model, without retraining it from scratch. Three principal strategies exist in current research and deployment:

1. Quantization


Replaces high-precision floating-point weights with lower-precision integers (e.g. FP16 to INT4), drastically reducing memory. It introduces quantization error, which is mitigated by methods like GPTQ, AWQ, and GGUF mixed precision.

In plain terms: Like converting a high-resolution photograph to a smaller file size by reducing the number of colours. Done carefully, it looks identical.

2. Pruning


Removes weights, neurons, or entire layers that contribute least to the model's outputs. Structured pruning removes whole components, while unstructured pruning removes individual weights but requires specialised hardware.

In plain terms: Like editing a book by removing sentences that add little to the story. The book gets shorter, but cutting the wrong sentences ruins the plot.

3. Distillation


Trains a smaller student model to mimic the output distribution and confidence of a larger teacher model. It produces a compact model that punches above its weight but requires significant compute and a full training run.

In plain terms: Like having an expert tutor a student intensively, so the student can perform nearly as well as the expert despite knowing less overall.

Why Quantization — and Why GGUF?

I chose quantization for three primary reasons:

  • No retraining required: Pruning and distillation demand compute resources not available on consumer hardware. Quantization is applied post-training, in minutes.
  • Mature, free tooling: The llama.cpp library provides a complete, well-documented pipeline to convert from HuggingFace format to GGUF, quantize, and run inference locally without a paid account.
  • Well-characterised tradeoff: Recent empirical work shows that 4-bit K-quants retain approximately 96–98% of FP16 quality while reducing model size by 3× or more.

GGUF Quantization Levels Selected

  • Q8_0: 8-bit quantization. Near-lossless. Largest of the three compressed variants.
  • Q5_K_M: 5.7-bit mixed precision. The practical sweet spot for GPUs in the 8–12 GB VRAM range.
  • Q4_K_M: 4.9-bit mixed precision. Maximum compression while remaining within acceptable quality bounds.

The FP16 original serves as the uncompressed baseline against which all three are measured.


Methods

Overview

This project is structured as a controlled local experiment. I run the same model at four precision levels — FP16, Q8_0, Q5_K_M, and Q4_K_M — on identical hardware under identical conditions, and measure three dimensions: inference speed (tokens per second), memory consumption (VRAM in GB), and output quality (manually scored on a fixed prompt set). The results are compiled into a structured data file and visualised as static charts embedded in a portfolio page.

No cloud compute is used at any stage. No external APIs are called during inference. The entire pipeline — from model download to chart generation — runs on a single local machine.


Implementation Methodology

Phase 0: Environment Preparation

Before any model is touched, the compute environment must be correctly configured. This phase establishes the Linux build environment inside Windows using WSL2 (Windows Subsystem for Linux 2: a compatibility layer that runs a real Linux kernel inside Windows with near-native performance and full GPU passthrough), installs the CUDA (Compute Unified Device Architecture: NVIDIA's parallel computing platform that allows software to use the GPU for general computation) toolkit using the WSL-specific package — which critically excludes the Linux GPU driver, avoiding conflict with the Windows display driver — and compiles llama.cpp from source with CUDA acceleration enabled.

The Python environment is configured separately with the libraries needed for model download (huggingface_hub) and chart generation (matplotlib, pandas).

Key risk managed in this phase: Installing the wrong CUDA package (the standard Linux Ubuntu package rather than the WSL-Ubuntu package) would attempt to install a Linux GPU driver that conflicts with the Windows driver already managing the GPU. The WSL-Ubuntu package is driver-free by design. This is verified by running nvidia-smi inside WSL2 after installation — if the GPU is visible, the environment is correctly configured.




© Dr. Balaji Ramanathan

Enhanced by JavaScript • Based on Slick Portfolio