When to fine-tune, when to use RAG, and how LoRA/QLoRA make it practical
Fine-tuning has become an essential technique for customizing large language models to specific domains, tasks, or behaviors. While prompt engineering and retrieval-augmented generation can accomplish much, certain use cases fundamentally require modifying the model's weights themselves.
This guide covers the key decisions around when fine-tuning is appropriate, the technical approaches that make it computationally tractable, and practical guidance for dataset preparation and training.
One of the most common questions in applied AI is: should I fine-tune or use RAG? The answer depends on what you're trying to achieve.
Naive fine-tuning updates all model parameters, which is prohibitively expensive for large models. A full fine-tune of GPT-3 (175B parameters) requires roughly 3,500 GB of GPU memory just for optimizer states—before considering the model weights themselves.
| Model Size | Full Fine-Tune GPU Memory | LoRA GPU Memory | Memory Reduction |
|---|---|---|---|
| 7B parameters | ~140 GB | ~8 GB | ~94% |
| 13B parameters | ~260 GB | ~16 GB | ~94% |
| 70B parameters | ~1400 GB | ~48 GB | ~97% |
Beyond memory, full fine-tuning risks catastrophic forgetting—updating all weights can cause the model to lose capabilities it previously had. If you fine-tune on customer support data, the model might lose its code writing abilities.
LoRA (Hu et al., 2021) is the most widely used parameter-efficient fine-tuning technique. The key insight: the weight updates during fine-tuning are low-rank—they can be well-approximated by the product of two smaller matrices.
Instead of updating the full weight matrix W, LoRA introduces trainable matrices A and B while keeping W frozen:
Original: y = Wx
LoRA: y = Wx + BAx
Where:
W: frozen pre-trained weight (d × d)
A: trainable matrix (r × d), initialized with random Gaussian
B: trainable matrix (d × r), initialized with zeros
r: rank, typically 4-64
The rank r is typically much smaller than d. For a 4096 × 4096 weight matrix with rank 8, you go from 16M parameters to just 65,536—a 99.6% reduction.
LoRA can be applied to different attention matrices. The original paper focused on Query (Q) and Value (V) projections:
| Hyperparameter | Typical Values | Effect |
|---|---|---|
| Rank (r) | 4, 8, 16, 32, 64 | Higher = more capacity, more memory, risk of overfitting |
| Alpha (α) | r, 2×r, or 1×r | Scales the LoRA contribution; larger = stronger adaptation |
| Dropout | 0, 0.05, 0.1 | Regularization; helps with small datasets |
| Target modules | q_proj, v_proj, etc. | Which attention layers to adapt |
QLoRA (Dettmers et al., 2023) combines quantization with LoRA to enable fine-tuning of 65B+ models on a single 48GB GPU. The technique has democratized access to large model customization.
QLoRA works by:
Memory breakdown for QLoRA 65B fine-tune:
- Base model (4-bit): ~35 GB
- LoRA adapters: ~0.4 GB
- Optimizer states (CPU): ~130 GB (offloaded)
- Activations: ~16 GB
- Total GPU memory: ~48 GB
The paper demonstrated that QLoRA matches full 16-bit fine-tuning quality on most benchmarks while using 75% less memory. On the Vicuna benchmark, QLoRA-tuned Guanaco models achieved 99.3% of the performance of ChatGPT while being trained on a single consumer GPU.
AdaLoRA dynamically adjusts the rank of different layers based on their importance, allocating more parameters to critical attention heads. LoRA+ introduces different learning rates for the A and B matrices, improving convergence.
These techniques add learnable adapters or hypercomplex layers between existing weights, achieving similar parameter efficiency with different trade-offs.
Instead of modifying attention weights, prefix tuning prepends trainable token embeddings to each attention layer. This achieves competitive performance but is less memory-efficient than LoRA.
The quality of your fine-tuning data is often more important than which technique you use. Guidelines:
LoRA typically needs 1,000-10,000 high-quality examples. Larger datasets can improve generalization but also increase overfitting risk. For specialized tasks, 1,000 carefully curated examples often outperforms 100,000 automatically generated ones.
Instruction format (recommended for most use cases):
{
"instruction": "Summarize the following customer feedback",
"input": "The product arrived damaged and customer service...",
"output": "Negative review citing damaged goods and poor support."
}
Chat format (for conversational models):
{
"messages": [
{"role": "user", "content": "Summarize this: ..."},
{"role": "assistant", "content": "Key points: ..."}
]
}
Understanding training costs helps with planning and budgeting:
| Configuration | GPU Hours | Cloud Cost (~$3/GPU-hr) | Quality vs Base |
|---|---|---|---|
| Llama 7B + LoRA (r=16) | ~4 hours (A100) | ~$12 | +15-25% on domain tasks |
| Llama 13B + LoRA (r=16) | ~8 hours (A100) | ~$24 | +18-28% on domain tasks |
| Llama 70B + QLoRA (r=64) | ~48 hours (A100) | ~$144 | +20-35% on domain tasks |
These are approximate figures for typical training runs. Actual costs depend on dataset size, number of epochs, and hyperparameters.
A typical LoRA/QLoRA training setup using the Hugging Face PEFT library:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b",
load_in_4bit=True,
torch_dtype=torch.bfloat16
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: "trainable params: 83M || all params: 8B || trainable%: 1.03"
After training, LoRA adapters can be merged with the base model for deployment:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-llama-3-8b")
The merged model behaves identically to the fine-tuned version but with standard inference—no special handling for LoRA parameters required.
Fine-tuning has become accessible to organizations of all sizes thanks to LoRA and QLoRA. The key decisions—rank selection, target layers, learning rates—are now well-understood, and the open-source tooling has matured significantly.
The most important factors remain the fundamentals: data quality, task clarity, and realistic expectations. Fine-tuning won't make a general model into an expert in a narrow domain without sufficient domain-specific examples. But for changing output styles, adapting to specific formats, and improving performance on domain vocabulary, well-executed fine-tuning delivers substantial improvements.