You have a GPU. Maybe it’s an RTX 3090 that mostly renders game frames, or a 4090 that’s been sitting underutilized since your last Stable Diffusion phase. Meanwhile, you’re paying OpenAI or Anthropic real money every month for a model that doesn’t know your domain, your style, or your internal terminology.
LoRA fine-tuning is the bridge. It lets you inject domain knowledge into a capable base model for a fraction of the compute cost of full fine-tuning. The problem is that most tutorials show you how to run a single training job manually. They don’t show you how to turn it into a pipeline — something that watches for new data, retrains automatically, and hot-swaps the adapter without human intervention.
That’s what this article is about.
By the end, you’ll have a working system that:
- Watches a directory for new training data
- Validates and preprocesses it into the right format
- Kicks off a LoRA training run on your local GPU
- Merges and quantizes the result
- Deploys it to a local Ollama endpoint automatically
- Logs everything so you can audit what changed and when
The reference implementation lives at https://github.com/huggingface/peft. We’ll be using PEFT heavily.
Prerequisites
- Linux host (or WSL2 with CUDA passthrough, though native Linux is strongly preferred)
- NVIDIA GPU with at least 12GB VRAM (16GB+ recommended for 7B models)
- CUDA 12.x installed,
nvidia-smireturns something useful - Python 3.11+, pip, and virtualenv
- Docker and Docker Compose (for the serving layer)
- Ollama installed and running (
ollama serve) - Git
Hardware reality check: you can fine-tune a Mistral 7B with 4-bit quantized base weights on 12GB VRAM. A Llama 3 8B fits the same envelope. For 13B models, 24GB is the practical floor. Below 12GB, look at 3B models or expect gradient checkpointing to slow things down significantly.
Architecture Overview
The pipeline has four stages that chain together automatically:
[Data Watcher] → [Preprocessor] → [Trainer] → [Deployer]
Each stage is a standalone Python script. A small orchestrator daemon wires them together. The whole thing runs as a systemd service so it survives reboots and failures.
Stage 1: Dataset Preparation
LoRA training expects instruction-tuned format. The dominant standard right now is the ChatML template or Alpaca format, depending on your base model. We’ll target ChatML since it’s what most modern models (Mistral, Llama 3, Qwen) expect.
Create your project structure first:
mkdir -p ~/lora-pipeline/{data/{raw,processed},adapters,logs,scripts}
cd ~/lora-pipeline
python3 -m venv .venv && source .venv/bin/activate
pip install transformers peft datasets accelerate bitsandbytes tqdm
Your raw data goes into data/raw/ as JSONL files. Each line is a JSON object with at minimum a prompt and response field, or already-formatted messages arrays. The preprocessor handles both.
scripts/preprocess.py:
#!/usr/bin/env python3
"""
Preprocessor: converts raw JSONL → ChatML-formatted dataset for training.
Validates structure, filters short/malformed samples, deduplicates.
"""
import json
import hashlib
import argparse
from pathlib import Path
from datasets import Dataset
CHATML_TEMPLATE = "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{response}<|im_end|>"
MIN_CHARS = 50 # discard samples shorter than this
MAX_CHARS = 4096 # discard samples longer than this (adjust for your context window)
def hash_sample(text: str) -> str:
return hashlib.md5(text.encode()).hexdigest()
def load_raw(path: Path) -> list[dict]:
samples = []
seen = set()
with open(path) as f:
for lineno, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
obj = json.loads(line)
except json.JSONDecodeError as e:
print(f" [WARN] line {lineno}: {e}")
continue
# Support both flat {prompt, response} and messages array format
if "messages" in obj:
turns = obj["messages"]
if len(turns) < 2:
continue
prompt = turns[0].get("content", "")
response = turns[1].get("content", "")
else:
prompt = obj.get("prompt", "").strip()
response = obj.get("response", "").strip()
text = CHATML_TEMPLATE.format(prompt=prompt, response=response)
if len(text) < MIN_CHARS or len(text) > MAX_CHARS:
continue
h = hash_sample(text)
if h in seen:
continue
seen.add(h)
samples.append({"text": text})
return samples
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--input-dir", required=True)
parser.add_argument("--output-path", required=True)
args = parser.parse_args()
all_samples = []
for raw_file in Path(args.input_dir).glob("*.jsonl"):
print(f"Processing {raw_file.name}...")
samples = load_raw(raw_file)
print(f" → {len(samples)} valid samples")
all_samples.extend(samples)
if not all_samples:
raise ValueError("No valid samples found. Check your input files.")
dataset = Dataset.from_list(all_samples)
dataset = dataset.train_test_split(test_size=0.05, seed=42)
dataset.save_to_disk(args.output_path)
print(f"\nDataset saved: {len(all_samples)} total samples → {args.output_path}")
if __name__ == "__main__":
main()
Gotcha #1: ChatML tokens (<|im_start|>, <|im_end|>) must be in the tokenizer’s vocabulary. For Mistral, they aren’t by default — you need to add them. The trainer script below handles this, but if you skip that step, your model will never learn proper turn boundaries and will generate forever.
Stage 2: LoRA Training Loop
The training script is the core of the pipeline. It loads the base model in 4-bit quantized form (QLoRA), attaches LoRA adapters to the attention layers, trains, and saves only the adapter weights — not the full 15GB model.
scripts/train.py:
#!/usr/bin/env python3
"""
QLoRA trainer: loads base model in 4-bit, trains LoRA adapters, saves to adapters/.
"""
import os
import sys
import json
import argparse
from datetime import datetime
from pathlib import Path
import torch
from datasets import load_from_disk
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
BitsAndBytesConfig,
DataCollatorForLanguageModeling,
Trainer,
)
from peft import LoraConfig, get_peft_model, TaskType
# ──────────────────────────────────────────────
# Defaults — override via CLI or config.json
# ──────────────────────────────────────────────
DEFAULTS = {
"base_model": "mistralai/Mistral-7B-Instruct-v0.3",
"lora_r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
"learning_rate": 2e-4,
"num_epochs": 3,
"batch_size": 2,
"gradient_accumulation_steps": 8,
"max_seq_length": 2048,
"warmup_ratio": 0.03,
}
def load_config(config_path: str) -> dict:
if config_path and Path(config_path).exists():
with open(config_path) as f:
return {**DEFAULTS, **json.load(f)}
return DEFAULTS.copy()
def add_special_tokens_if_missing(tokenizer, model):
special_tokens = ["<|im_start|>", "<|im_end|>"]
existing = set(tokenizer.get_vocab().keys())
missing = [t for t in special_tokens if t not in existing]
if missing:
tokenizer.add_special_tokens({"additional_special_tokens": missing})
model.resize_token_embeddings(len(tokenizer))
print(f"Added special tokens: {missing}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dataset-path", required=True)
parser.add_argument("--output-dir", required=True)
parser.add_argument("--config", default="")
parser.add_argument("--run-name", default=datetime.now().strftime("%Y%m%d_%H%M%S"))
args = parser.parse_args()
cfg = load_config(args.config)
adapter_path = Path(args.output_dir) / args.run_name
adapter_path.mkdir(parents=True, exist_ok=True)
print(f"Run: {args.run_name}")
print(f"Base model: {cfg['base_model']}")
# ── Load model in 4-bit (QLoRA) ──
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(cfg["base_model"], use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
cfg["base_model"],
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model.config.use_cache = False # required for gradient checkpointing
model.enable_input_require_grads()
add_special_tokens_if_missing(tokenizer, model)
# ── Attach LoRA adapters ──
lora_config = LoraConfig(
r=cfg["lora_r"],
lora_alpha=cfg["lora_alpha"],
target_modules=cfg["target_modules"],
lora_dropout=cfg["lora_dropout"],
bias="none",
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# ── Load and tokenize dataset ──
dataset = load_from_disk(args.dataset_path)
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=cfg["max_seq_length"],
padding=False,
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
# ── Training arguments ──
training_args = TrainingArguments(
output_dir=str(adapter_path / "checkpoints"),
num_train_epochs=cfg["num_epochs"],
per_device_train_batch_size=cfg["batch_size"],
gradient_accumulation_steps=cfg["gradient_accumulation_steps"],
gradient_checkpointing=True,
learning_rate=cfg["learning_rate"],
warmup_ratio=cfg["warmup_ratio"],
lr_scheduler_type="cosine",
optim="paged_adamw_8bit",
fp16=False,
bf16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
load_best_model_at_end=True,
report_to="none", # set to "wandb" if you want tracking
run_name=args.run_name,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
model.save_pretrained(str(adapter_path / "adapter"))
tokenizer.save_pretrained(str(adapter_path / "adapter"))
# Write a manifest so the deployer knows what we built
manifest = {
"run_name": args.run_name,
"base_model": cfg["base_model"],
"adapter_path": str(adapter_path / "adapter"),
"trained_at": datetime.now().isoformat(),
"train_samples": len(tokenized["train"]),
}
with open(adapter_path / "manifest.json", "w") as f:
json.dump(manifest, f, indent=2)
print(f"\nAdapter saved to: {adapter_path / 'adapter'}")
print(f"Manifest: {adapter_path / 'manifest.json'}")
if __name__ == "__main__":
main()
Gotcha #2: gradient_checkpointing=True with PEFT requires model.enable_input_require_grads() — without that call, the gradients on the first layer’s input simply don’t exist, and training silently produces garbage. This bit me for an afternoon.
Gotcha #3: paged_adamw_8bit is the optimizer that makes QLoRA practical on 12GB cards. Standard AdamW will OOM. If you’re on 24GB and want faster convergence, adamw_torch_fused is a better choice.
Stage 3: Automated Deployment with Ollama
Once the adapter is trained, you need to merge it into a full model (or leave it as a standalone adapter if your serving layer supports PEFT natively) and register it with Ollama.
Ollama doesn’t support raw PEFT adapters directly. You need to either merge the adapter into the base model or use a GGUF-based workflow. The latter is more practical on local hardware because GGUF allows per-layer offloading.
scripts/deploy.py:
#!/usr/bin/env python3
"""
Deployer: merges LoRA adapter into base model, converts to GGUF, registers with Ollama.
Requires llama.cpp's convert script on PATH as 'convert_hf_to_gguf.py'.
"""
import json
import shutil
import subprocess
import argparse
from pathlib import Path
from datetime import datetime
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
def merge_adapter(manifest: dict, merged_dir: Path):
print("Merging adapter into base model...")
base_model = manifest["base_model"]
adapter_path = manifest["adapter_path"]
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
base = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
device_map="cpu", # merge on CPU to avoid VRAM pressure
)
model = PeftModel.from_pretrained(base, adapter_path)
model = model.merge_and_unload()
model.save_pretrained(str(merged_dir), safe_serialization=True)
tokenizer.save_pretrained(str(merged_dir))
print(f"Merged model saved to {merged_dir}")
def convert_to_gguf(merged_dir: Path, gguf_path: Path, quant: str = "q4_k_m"):
print(f"Converting to GGUF ({quant})...")
# Step 1: HF → fp16 GGUF
fp16_gguf = gguf_path.parent / "model-fp16.gguf"
subprocess.run(
["python3", "convert_hf_to_gguf.py", str(merged_dir), "--outfile", str(fp16_gguf)],
check=True,
)
# Step 2: Quantize with llama-quantize
subprocess.run(
["llama-quantize", str(fp16_gguf), str(gguf_path), quant],
check=True,
)
fp16_gguf.unlink() # clean up the large intermediate file
print(f"GGUF saved to {gguf_path}")
def register_with_ollama(gguf_path: Path, model_name: str):
modelfile_content = f"""FROM {gguf_path}
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
"""
modelfile_path = gguf_path.parent / "Modelfile"
modelfile_path.write_text(modelfile_content)
print(f"Registering {model_name} with Ollama...")
subprocess.run(
["ollama", "create", model_name, "-f", str(modelfile_path)],
check=True,
)
print(f"Model {model_name} is live. Test it: ollama run {model_name}")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--manifest", required=True)
parser.add_argument("--work-dir", required=True)
parser.add_argument("--model-name", default="finetuned-local")
parser.add_argument("--quant", default="q4_k_m")
args = parser.parse_args()
with open(args.manifest) as f:
manifest = json.load(f)
work_dir = Path(args.work_dir) / manifest["run_name"]
work_dir.mkdir(parents=True, exist_ok=True)
merged_dir = work_dir / "merged"
gguf_path = work_dir / "model.gguf"
merge_adapter(manifest, merged_dir)
convert_to_gguf(merged_dir, gguf_path, args.quant)
register_with_ollama(gguf_path, args.model_name)
# Clean up the large merged weights — GGUF is what we serve
shutil.rmtree(merged_dir)
print(f"\nDeployment complete: {datetime.now().isoformat()}")
if __name__ == "__main__":
main()
Gotcha #4: The merge happens on CPU because loading two large models on GPU simultaneously will OOM on anything below 48GB. It’s slow (15-30 minutes for a 7B model), but it works reliably. Set device_map="cpu" for the base model load.
Gotcha #5: llama-quantize and convert_hf_to_gguf.py come from building llama.cpp from source. Most distributions don’t package these. Build llama.cpp once and add its build/bin/ to your PATH:
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -B build -DLLAMA_CUDA=ON && cmake --build build -j$(nproc)
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
ln -s ~/llama.cpp/convert_hf_to_gguf.py ~/bin/
Stage 4: The Orchestrator
The three scripts above are building blocks. The orchestrator watches for new data and chains them together.
scripts/orchestrator.py:
#!/usr/bin/env python3
"""
Pipeline orchestrator: watches data/raw/ for new .jsonl files and runs the full pipeline.
Uses a lockfile to prevent concurrent runs.
"""
import os
import time
import fcntl
import subprocess
import logging
from pathlib import Path
from datetime import datetime
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
handlers=[
logging.FileHandler("logs/orchestrator.log"),
logging.StreamHandler(),
],
)
log = logging.getLogger(__name__)
BASE_DIR = Path(__file__).parent.parent
RAW_DIR = BASE_DIR / "data/raw"
PROCESSED_DIR = BASE_DIR / "data/processed"
ADAPTERS_DIR = BASE_DIR / "adapters"
TRIGGER_FILE = BASE_DIR / ".pipeline_trigger"
LOCK_FILE = BASE_DIR / ".pipeline.lock"
VENV_PYTHON = BASE_DIR / ".venv/bin/python3"
MODEL_NAME = os.getenv("PIPELINE_MODEL_NAME", "finetuned-local")
def run(cmd: list, **kwargs):
log.info("Running: %s", " ".join(str(c) for c in cmd))
result = subprocess.run(cmd, **kwargs)
if result.returncode != 0:
raise RuntimeError(f"Command failed with exit code {result.returncode}")
return result
def pipeline():
run_name = datetime.now().strftime("%Y%m%d_%H%M%S")
processed_path = PROCESSED_DIR / run_name
log.info("=== Pipeline run: %s ===", run_name)
# Stage 1: Preprocess
run([VENV_PYTHON, "scripts/preprocess.py",
"--input-dir", str(RAW_DIR),
"--output-path", str(processed_path)])
# Stage 2: Train
manifest_path = ADAPTERS_DIR / run_name / "manifest.json"
run([VENV_PYTHON, "scripts/train.py",
"--dataset-path", str(processed_path),
"--output-dir", str(ADAPTERS_DIR),
"--run-name", run_name])
# Stage 3: Deploy
run([VENV_PYTHON, "scripts/deploy.py",
"--manifest", str(manifest_path),
"--work-dir", str(ADAPTERS_DIR / "deploy"),
"--model-name", MODEL_NAME])
# Archive processed data to avoid reprocessing
processed_path.rename(PROCESSED_DIR / f"{run_name}.done")
log.info("=== Pipeline complete: %s ===", run_name)
def watch():
log.info("Watching %s for new data...", RAW_DIR)
while True:
if TRIGGER_FILE.exists():
lock_fd = open(LOCK_FILE, "w")
try:
fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
TRIGGER_FILE.unlink()
try:
pipeline()
except Exception as e:
log.error("Pipeline failed: %s", e, exc_info=True)
finally:
fcntl.flock(lock_fd, fcntl.LOCK_UN)
except BlockingIOError:
log.info("Another run already in progress, skipping trigger.")
finally:
lock_fd.close()
time.sleep(30)
if __name__ == "__main__":
watch()
To trigger a run manually or from a cron job or a file watcher:
touch ~/lora-pipeline/.pipeline_trigger
You can also set up inotifywait to trigger automatically on new JSONL files:
# Add to a separate tmux pane or systemd service
while inotifywait -e close_write ~/lora-pipeline/data/raw/; do
touch ~/lora-pipeline/.pipeline_trigger
done
Systemd Service
Wire everything up as a persistent service:
/etc/systemd/system/lora-pipeline.service:
[Unit]
Description=LoRA Fine-Tuning Pipeline Orchestrator
After=network.target ollama.service
[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username/lora-pipeline
ExecStart=/home/your-username/lora-pipeline/.venv/bin/python3 scripts/orchestrator.py
Restart=on-failure
RestartSec=30
Environment=PIPELINE_MODEL_NAME=my-domain-model
Environment=PYTHONUNBUFFERED=1
[Install]
WantedBy=multi-user.target
sudo systemctl enable --now lora-pipeline
journalctl -u lora-pipeline -f
Production-Ready Additions
Eval before deploy: Don’t ship blindly. Write a small eval script that scores the new adapter on a held-out eval set (perplexity or a handful of gold-standard prompt/completion pairs). Gate the deploy step on passing a minimum score threshold. One bad data batch will cause silent regressions otherwise.
Training config versioning: Store a config.json alongside each run in the adapters directory. When something breaks, you’ll want to know the exact hyperparameters used, not the ones currently in your script.
VRAM headroom monitoring: Add a pre-flight check before training starts. If nvidia-smi shows less than 10GB free, abort with a log entry. Shared VRAM (game running in another session, another inference process) will cause cryptic OOM kills mid-training.
Adapter versioning: Don’t overwrite the Ollama model on every run. Use timestamped names (my-model:20260520) and keep a symlink my-model:latest pointing to the current best. This lets you roll back in 10 seconds with ollama create my-model:latest -f Modelfile-prev.
Data quality gate: Before preprocessing, count samples, check average response length, and flag if variance is too high or too low. A batch of 50-character responses training against a model that normally produces 500-character responses will ruin your loss curve without any obvious error.
Realistic Timelines
On an RTX 4090 with a 7B model and a dataset of ~5,000 samples, expect:
- Preprocessing: 2-3 minutes
- Training (3 epochs): 45-90 minutes
- Merge + GGUF conversion: 20-30 minutes
- Total wall clock: under 2 hours per cycle
On a 3090 with a 7B model: add ~40% to training time. On a 3080 (10GB), you’re looking at a 3B model or significant context window reduction.
Final Thoughts
This setup has a real superpower: feedback loops. Your internal tool generates text, users flag bad outputs, flagged outputs become training samples, the pipeline retrains overnight, and the model gets better without manual intervention. That’s the actual value here — not one-shot fine-tuning, but a continuously improving specialized model that you control entirely.
The weak points to watch are data quality and eval rigor. Anyone can build a pipeline that retrains. The discipline is building one that only deploys improvements. Add your eval gate before this sees production traffic.