Fine-Tuning LLM Guide

When to fine-tune, when to use RAG, and how LoRA/QLoRA make it practical

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server hardware representing AI model training infrastructure

Fine-tuning has become an essential technique for customizing large language models to specific domains, tasks, or behaviors. While prompt engineering and retrieval-augmented generation can accomplish much, certain use cases fundamentally require modifying the model's weights themselves.

This guide covers the key decisions around when fine-tuning is appropriate, the technical approaches that make it computationally tractable, and practical guidance for dataset preparation and training.

When to Fine-Tune vs. When to Use RAG

One of the most common questions in applied AI is: should I fine-tune or use RAG? The answer depends on what you're trying to achieve.

Fine-Tuning is Right When:

RAG is Right When:

The False Dichotomy: Many production systems use both. A fine-tuned model can provide better reasoning and formatting while RAG provides current, source-specific information. This combination often outperforms either approach alone.

The Problem with Full Fine-Tuning

Naive fine-tuning updates all model parameters, which is prohibitively expensive for large models. A full fine-tune of GPT-3 (175B parameters) requires roughly 3,500 GB of GPU memory just for optimizer states—before considering the model weights themselves.

Model Size Full Fine-Tune GPU Memory LoRA GPU Memory Memory Reduction
7B parameters ~140 GB ~8 GB ~94%
13B parameters ~260 GB ~16 GB ~94%
70B parameters ~1400 GB ~48 GB ~97%

Beyond memory, full fine-tuning risks catastrophic forgetting—updating all weights can cause the model to lose capabilities it previously had. If you fine-tune on customer support data, the model might lose its code writing abilities.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used parameter-efficient fine-tuning technique. The key insight: the weight updates during fine-tuning are low-rank—they can be well-approximated by the product of two smaller matrices.

Instead of updating the full weight matrix W, LoRA introduces trainable matrices A and B while keeping W frozen:

Original: y = Wx
LoRA:      y = Wx + BAx

Where:
  W: frozen pre-trained weight (d × d)
  A: trainable matrix (r × d), initialized with random Gaussian
  B: trainable matrix (d × r), initialized with zeros
  r: rank, typically 4-64
    

The rank r is typically much smaller than d. For a 4096 × 4096 weight matrix with rank 8, you go from 16M parameters to just 65,536—a 99.6% reduction.

Where to Apply LoRA

LoRA can be applied to different attention matrices. The original paper focused on Query (Q) and Value (V) projections:

LoRA Hyperparameters

Hyperparameter Typical Values Effect
Rank (r) 4, 8, 16, 32, 64 Higher = more capacity, more memory, risk of overfitting
Alpha (α) r, 2×r, or 1×r Scales the LoRA contribution; larger = stronger adaptation
Dropout 0, 0.05, 0.1 Regularization; helps with small datasets
Target modules q_proj, v_proj, etc. Which attention layers to adapt

QLoRA: Quantized LoRA for Even Lower Memory

QLoRA (Dettmers et al., 2023) combines quantization with LoRA to enable fine-tuning of 65B+ models on a single 48GB GPU. The technique has democratized access to large model customization.

QLoRA works by:

Memory breakdown for QLoRA 65B fine-tune:
  - Base model (4-bit): ~35 GB
  - LoRA adapters: ~0.4 GB
  - Optimizer states (CPU): ~130 GB (offloaded)
  - Activations: ~16 GB
  - Total GPU memory: ~48 GB
    

QLoRA vs LoRA Performance

The paper demonstrated that QLoRA matches full 16-bit fine-tuning quality on most benchmarks while using 75% less memory. On the Vicuna benchmark, QLoRA-tuned Guanaco models achieved 99.3% of the performance of ChatGPT while being trained on a single consumer GPU.

Other PEFT Techniques

AdaLoRA and LoRA+

AdaLoRA dynamically adjusts the rank of different layers based on their importance, allocating more parameters to critical attention heads. LoRA+ introduces different learning rates for the A and B matrices, improving convergence.

QAdapter and Compacter

These techniques add learnable adapters or hypercomplex layers between existing weights, achieving similar parameter efficiency with different trade-offs.

Prefix Tuning

Instead of modifying attention weights, prefix tuning prepends trainable token embeddings to each attention layer. This achieves competitive performance but is less memory-efficient than LoRA.

Dataset Preparation

The quality of your fine-tuning data is often more important than which technique you use. Guidelines:

Dataset Size

LoRA typically needs 1,000-10,000 high-quality examples. Larger datasets can improve generalization but also increase overfitting risk. For specialized tasks, 1,000 carefully curated examples often outperforms 100,000 automatically generated ones.

Data Format

Instruction format (recommended for most use cases):
{
  "instruction": "Summarize the following customer feedback",
  "input": "The product arrived damaged and customer service...",
  "output": "Negative review citing damaged goods and poor support."
}

Chat format (for conversational models):
{
  "messages": [
    {"role": "user", "content": "Summarize this: ..."},
    {"role": "assistant", "content": "Key points: ..."}
  ]
}
    

Quality Filtering

Training Compute Costs

Understanding training costs helps with planning and budgeting:

Configuration GPU Hours Cloud Cost (~$3/GPU-hr) Quality vs Base
Llama 7B + LoRA (r=16) ~4 hours (A100) ~$12 +15-25% on domain tasks
Llama 13B + LoRA (r=16) ~8 hours (A100) ~$24 +18-28% on domain tasks
Llama 70B + QLoRA (r=64) ~48 hours (A100) ~$144 +20-35% on domain tasks

These are approximate figures for typical training runs. Actual costs depend on dataset size, number of epochs, and hyperparameters.

Practical Training Pipeline

A typical LoRA/QLoRA training setup using the Hugging Face PEFT library:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8b",
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: "trainable params: 83M || all params: 8B || trainable%: 1.03"
    

Merging and Deployment

After training, LoRA adapters can be merged with the base model for deployment:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-llama-3-8b")
    

The merged model behaves identically to the fine-tuned version but with standard inference—no special handling for LoRA parameters required.

Evaluation Best Practices

Conclusion

Fine-tuning has become accessible to organizations of all sizes thanks to LoRA and QLoRA. The key decisions—rank selection, target layers, learning rates—are now well-understood, and the open-source tooling has matured significantly.

The most important factors remain the fundamentals: data quality, task clarity, and realistic expectations. Fine-tuning won't make a general model into an expert in a narrow domain without sufficient domain-specific examples. But for changing output styles, adapting to specific formats, and improving performance on domain vocabulary, well-executed fine-tuning delivers substantial improvements.