Fine-Tuning LLM Guide

When to fine-tune, when to use RAG, and how LoRA/QLoRA make it practical

Published: January 2026 | Reading Time: 15 minutes | Category: AI & Machine Learning

Server hardware representing AI model training infrastructure

Fine-tuning has become an essential technique for customizing large language models to specific domains, tasks, or behaviors. While prompt engineering and retrieval-augmented generation can accomplish much, certain use cases fundamentally require modifying the model's weights themselves.

This guide covers the key decisions around when fine-tuning is appropriate, the technical approaches that make it computationally tractable, and practical guidance for dataset preparation and training.

When to Fine-Tune vs. When to Use RAG

One of the most common questions in applied AI is: should I fine-tune or use RAG? The answer depends on what you're trying to achieve.

Fine-Tuning is Right When:

You need to change how the model responds—tone, format, reasoning patterns, or communication style
Domain-specific task performance is poor—the base model doesn't understand your domain's concepts
Latency is critical—you can't afford the retrieval step in your pipeline
You're personalizing—adapting to individual user preferences at inference time
API costs are prohibitive—you call the model millions of times with similar inputs

RAG is Right When:

Your knowledge changes frequently—documents update, inventory changes, news happens
You need verifiable sources—citations to specific documents matter
Debugging matters—you want to know exactly which documents informed an answer
Multiple knowledge bases are involved—retrieval can query different sources
You have limited training data—fine-tuning requires substantial curated datasets

        The False Dichotomy: Many production systems use both. A fine-tuned model can provide better reasoning and formatting while RAG provides current, source-specific information. This combination often outperforms either approach alone.
    

The Problem with Full Fine-Tuning

Naive fine-tuning updates all model parameters, which is prohibitively expensive for large models. A full fine-tune of GPT-3 (175B parameters) requires roughly 3,500 GB of GPU memory just for optimizer states—before considering the model weights themselves.

Model Size	Full Fine-Tune GPU Memory	LoRA GPU Memory	Memory Reduction
7B parameters	~140 GB	~8 GB	~94%
13B parameters	~260 GB	~16 GB	~94%
70B parameters	~1400 GB	~48 GB	~97%

Beyond memory, full fine-tuning risks catastrophic forgetting—updating all weights can cause the model to lose capabilities it previously had. If you fine-tune on customer support data, the model might lose its code writing abilities.

LoRA: Low-Rank Adaptation

LoRA (Hu et al., 2021) is the most widely used parameter-efficient fine-tuning technique. The key insight: the weight updates during fine-tuning are low-rank—they can be well-approximated by the product of two smaller matrices.

Instead of updating the full weight matrix W, LoRA introduces trainable matrices A and B while keeping W frozen:

Original: y = Wx
LoRA:      y = Wx + BAx

Where:
  W: frozen pre-trained weight (d × d)
  A: trainable matrix (r × d), initialized with random Gaussian
  B: trainable matrix (d × r), initialized with zeros
  r: rank, typically 4-64

The rank r is typically much smaller than d. For a 4096 × 4096 weight matrix with rank 8, you go from 16M parameters to just 65,536—a 99.6% reduction.

Where to Apply LoRA

LoRA can be applied to different attention matrices. The original paper focused on Query (Q) and Value (V) projections:

q_proj, v_proj: Most common, good balance of performance and efficiency
k_proj: Often included, adds small overhead for meaningful gains
o_proj: Output projection, minimal impact
All linear layers: "LoRA+" variants train all layers, higher memory but often better quality

LoRA Hyperparameters

Hyperparameter	Typical Values	Effect
Rank (r)	4, 8, 16, 32, 64	Higher = more capacity, more memory, risk of overfitting
Alpha (α)	r, 2×r, or 1×r	Scales the LoRA contribution; larger = stronger adaptation
Dropout	0, 0.05, 0.1	Regularization; helps with small datasets
Target modules	q_proj, v_proj, etc.	Which attention layers to adapt

QLoRA: Quantized LoRA for Even Lower Memory

QLoRA (Dettmers et al., 2023) combines quantization with LoRA to enable fine-tuning of 65B+ models on a single 48GB GPU. The technique has democratized access to large model customization.

QLoRA works by:

4-bit NormalFloat (NF4) quantization: A custom quantization scheme optimized for normally-distributed weights
Double quantization: Quantizing the quantization constants themselves
Paged optimizers: Offloading optimizer states to CPU memory when GPU memory is tight

Memory breakdown for QLoRA 65B fine-tune:
  - Base model (4-bit): ~35 GB
  - LoRA adapters: ~0.4 GB
  - Optimizer states (CPU): ~130 GB (offloaded)
  - Activations: ~16 GB
  - Total GPU memory: ~48 GB

QLoRA vs LoRA Performance

The paper demonstrated that QLoRA matches full 16-bit fine-tuning quality on most benchmarks while using 75% less memory. On the Vicuna benchmark, QLoRA-tuned Guanaco models achieved 99.3% of the performance of ChatGPT while being trained on a single consumer GPU.

Other PEFT Techniques

AdaLoRA and LoRA+

AdaLoRA dynamically adjusts the rank of different layers based on their importance, allocating more parameters to critical attention heads. LoRA+ introduces different learning rates for the A and B matrices, improving convergence.

QAdapter and Compacter

These techniques add learnable adapters or hypercomplex layers between existing weights, achieving similar parameter efficiency with different trade-offs.

Prefix Tuning

Instead of modifying attention weights, prefix tuning prepends trainable token embeddings to each attention layer. This achieves competitive performance but is less memory-efficient than LoRA.

Dataset Preparation

The quality of your fine-tuning data is often more important than which technique you use. Guidelines:

Dataset Size

LoRA typically needs 1,000-10,000 high-quality examples. Larger datasets can improve generalization but also increase overfitting risk. For specialized tasks, 1,000 carefully curated examples often outperforms 100,000 automatically generated ones.

Data Format

Instruction format (recommended for most use cases):
{
  "instruction": "Summarize the following customer feedback",
  "input": "The product arrived damaged and customer service...",
  "output": "Negative review citing damaged goods and poor support."
}

Chat format (for conversational models):
{
  "messages": [
    {"role": "user", "content": "Summarize this: ..."},
    {"role": "assistant", "content": "Key points: ..."}
  ]
}

Quality Filtering

Remove responses that are factually wrong or hallucinated
Ensure consistent formatting and length across examples
Balance classes if doing classification (equal positive/negative)
Use human review for edge cases and high-stakes domains

Training Compute Costs

Understanding training costs helps with planning and budgeting:

Configuration	GPU Hours	Cloud Cost (~$3/GPU-hr)	Quality vs Base
Llama 7B + LoRA (r=16)	~4 hours (A100)	~$12	+15-25% on domain tasks
Llama 13B + LoRA (r=16)	~8 hours (A100)	~$24	+18-28% on domain tasks
Llama 70B + QLoRA (r=64)	~48 hours (A100)	~$144	+20-35% on domain tasks

These are approximate figures for typical training runs. Actual costs depend on dataset size, number of epochs, and hyperparameters.

Practical Training Pipeline

A typical LoRA/QLoRA training setup using the Hugging Face PEFT library:

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8b",
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: "trainable params: 83M || all params: 8B || trainable%: 1.03"

Merging and Deployment

After training, LoRA adapters can be merged with the base model for deployment:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8b")
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-llama-3-8b")

The merged model behaves identically to the fine-tuned version but with standard inference—no special handling for LoRA parameters required.

Evaluation Best Practices

Hold-out test set: Reserve 5-10% of your data for evaluation
Human evaluation: Automated metrics (BLEU, ROUGE) correlate poorly with LLM quality
Domain-specific benchmarks: Create a benchmark suite representing real tasks
Catastrophic forgetting tests: Verify the model retains capabilities from pre-training

Conclusion

Fine-tuning has become accessible to organizations of all sizes thanks to LoRA and QLoRA. The key decisions—rank selection, target layers, learning rates—are now well-understood, and the open-source tooling has matured significantly.

The most important factors remain the fundamentals: data quality, task clarity, and realistic expectations. Fine-tuning won't make a general model into an expert in a narrow domain without sufficient domain-specific examples. But for changing output styles, adapting to specific formats, and improving performance on domain vocabulary, well-executed fine-tuning delivers substantial improvements.

How Large Language Models Work RAG System Design Prompt Engineering Principles