The Hallucination Problem in Specialized Domains
General LLMs like GPT-4 know a lot about the world but almost nothing about your specific water network topology, your equipment maintenance history, or your regulatory compliance requirements. Ask them domain-specific questions and they will confidently give you wrong answers.
Two Approaches: RAG vs. Fine-Tuning
RAG (Retrieval-Augmented Generation): Better for frequently changing knowledge. Cheaper. Start here.
Fine-Tuning: Better for consistent behavior, tone, and domain-specific reasoning patterns. Required when you need the model to think in your domain, not just reference it.
Fine-Tuning with QLoRA
For most organizations, QLoRA (Quantized Low-Rank Adaptation) is the practical path:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
quantization_config=bnb_config
)
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
model = get_peft_model(model, lora_config)
Dataset Quality > Dataset Size
For domain fine-tuning, 1,000 high-quality Q&A pairs from domain experts outperform 100,000 scraped examples. Garbage in, garbage out.
Evaluation Metrics That Matter
- Domain accuracy on held-out test set
- Hallucination rate (factual consistency)
- Response latency at P95
- Human evaluation from domain experts