Running powerful language models on your own hardware
Local LLM deployment has democratized access to capable AI. What once required expensive API calls to OpenAI or Anthropic can now run on consumer hardware. This guide covers the tools, techniques, and trade-offs involved in running language models locally—from lightweight laptops to beefy workstations.
The economics are compelling: after the upfront hardware investment, local inference is free. No per-token costs, no rate limits, no data leaving your machine. For high-volume applications, personal use, or privacy-sensitive workloads, local deployment makes economic sense.
Quantization reduces model memory footprint by using lower-precision number formats. Instead of 32-bit floats (4 bytes per parameter), quantization can use 8-bit integers (1 byte) or even 4-bit (0.5 bytes).
| Format | Bits/Param | 7B Model Size | 70B Model Size | Quality Loss |
|---|---|---|---|---|
| FP16 (baseline) | 16 bits | 14 GB | 140 GB | None |
| INT8 | 8 bits | 7 GB | 70 GB | ~2% |
| Q4_K | ~4.5 bits | ~4 GB | ~40 GB | ~5% |
| Q5_K | ~5.5 bits | ~4.9 GB | ~48 GB | ~2% |
| Q8_0 | 8 bits | ~7 GB | ~70 GB | ~1% |
Q4_K and Q5_K are GGUF formats from llama.cpp—they use mixed precision with key layers in higher precision. The K designation indicates the quantization was optimized for the specific model architecture.
Beyond model weights, consider memory for activations during inference:
VRAM Breakdown for 7B Q4_K model:
Model weights: 3.8 GB (Q4_K)
KV cache (full ctx): 1.5 GB (varies with batch size)
Activations: ~0.5 GB
Framework overhead: ~0.5 GB
─────────────────────────────
Total: ~6.3 GB
For comparison:
FP16 would need: ~14 GB (weights alone)
With KV cache: ~16+ GB
Ollama is the easiest way to get started with local LLMs. It handles model downloading, quantization selection, and serving with a single command:
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Download and run a model
ollama run llama3.1 8b
# Or with specific quantization
ollama run llama3.1:70b-instruct-q4_K_M
Ollama provides an OpenAI-compatible API, making it easy to switch between local and cloud models:
# OpenAI SDK usage with Ollama
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1",
api_key="not-needed")
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello!"}]
)
Ollama supports a wide model library including Llama 3.1, Mistral, Phi-3, Gemma 2, and many specialized models. It automatically downloads the recommended quantization level for your system.
LM Studio provides a polished GUI and robust local serving capabilities:
LM Studio is ideal for users who want a GUI experience while still having API access for development.
llama.cpp is the foundational technology that powers much of local LLM inference. It's a C/C++ implementation optimized for efficient inference on CPU and GPU:
# Build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && mkdir build && cd build && cmake .. && make
# Convert model to GGUF format
python3 ../convert.py /path/to/llama/model --outfile model.gguf
# Quantize to Q4_K
./quantize model.gguf model-q4_k.gguf Q4_K_M
# Run inference
./main -m model-q4_k.gguf -n 512 -p "Hello, how are you?"
llama.cpp provides the most control and is what powers Ollama and LM Studio under the hood. For production deployments, llama.cpp with proper batching and parallelization can match or exceed more user-friendly tools.
| Setup | Token/sec | Context Length | Best For |
|---|---|---|---|
| Llama 3.1 8B (Q4, RTX 3080) | ~45 tok/s | 8K | Quick tasks, coding |
| Llama 3.1 70B (Q4, RTX 4090 x2) | ~25 tok/s | 8K | High quality, complex reasoning |
| Mistral 7B (Q4, MacBook M3 Pro) | ~35 tok/s | 8K | Portability, Apple ecosystem |
| Llama 3.1 8B (CPU only, 32GB RAM) | ~8 tok/s | 4K | Low budget, no GPU |
Token generation speed depends heavily on GPU (or lack thereof), model size, quantization level, and batch configuration.
Estimating VRAM requirements before purchase:
Formula:
VRAM_needed ≈ (model_parameters × quantization_bits) / 8
+ context_length × 2 × bytes_per_token
For 70B model at Q4_K_M (~4.7 bits effective):
Weights: 70B × 4.7 / 8 ≈ 41 GB
KV cache (8K ctx): 8000 × 2 × 2 × 2 ≈ 0.06 GB (negligible)
Minimum VRAM: ~42 GB (single A100 80GB works)
Recommended: 2 × 24GB = 48GB for headroom
| Model | Q4_K_M VRAM | Minimum GPU | Recommended GPU |
|---|---|---|---|
| 1B (Phi-3-mini) | ~700 MB | Integrated GPU | GTX 1060 |
| 7B (Llama 3.1, Mistral) | ~4 GB | GTX 1080, RTX 3060 | RTX 3080, RTX 4070 |
| 13B | ~8 GB | RTX 3080, RTX 4070 | RTX 3090, RTX 4080 |
| 34B (Llama 3.1) | ~20 GB | RTX 4090, A6000 | A100 40GB |
| 70B | ~40 GB | A100 80GB | A100 80GB or dual GPUs |
When does local make economic sense?
Scenario: 1 million tokens/day
API Costs (GPT-4o-mini at $0.15/1M output tokens):
Input: ~500K tokens × $0.15/1M = $0.075
Output: 500K tokens × $0.15/1M = $0.075
Monthly cost: $4.50/day × 30 = $135/month
Hardware amortization (3-year, RTX 4080 at $1200):
4080 runs Llama 3.1 8B at ~40 tok/s
1M tokens / 40 tok/s = 25,000 seconds = ~7 hours/day
Monthly: 210 hours × 0.4 kW × $0.12/kWh = $10.08 electricity
Amortized hardware: $1200 / 36 months = $33/month
Total: ~$43/month
Break-even: ~260K tokens/day
At higher volumes or with larger models, local becomes increasingly cost-effective. The math improves further if you already have suitable GPU hardware.
For development and production serving, configure your local server properly:
# Ollama server configuration
# Environment variables
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_NUM_PARALLEL=4
OLLAMA_MAX_LOADED_MODELS=2
OLLAMA_GPU_OVERHEAD=0
# Common server flags for llama.cpp
./server -m model-q4_k.gguf \
-c 4096 \ # context window
-tb 32 \ # tensor batch size
-ngl 99 \ # layers on GPU (99 = all)
-t 8 \ # threads
--mlock # lock model in memory
Local LLM deployment has matured significantly. Tools like Ollama and LM Studio make it accessible to anyone, while llama.cpp provides the foundation for sophisticated production deployments. The choice between local and API depends on volume, privacy requirements, and willingness to manage infrastructure.
For most users, starting with Ollama is the right move—easy installation, good defaults, and OpenAI-compatible API for seamless integration. As needs grow, exploring quantization levels, hardware upgrades, or direct llama.cpp deployment becomes worthwhile.