Stop Paying Per Token: Build a Continuous LoRA Fine-Tuning Pipeline on Your Local GPU

You have a GPU. Maybe it’s an RTX 3090 that mostly renders game frames, or a 4090 that’s been sitting underutilized since your last Stable Diffusion phase. Meanwhile, you’re paying OpenAI or Anthropic real money every month for a model that doesn’t know your domain, your style, or your internal terminology.

LoRA fine-tuning is the bridge. It lets you inject domain knowledge into a capable base model for a fraction of the compute cost of full fine-tuning. The problem is that most tutorials show you how to run a single training job manually. They don’t show you how to turn it into a pipeline — something that watches for new data, retrains automatically, and hot-swaps the adapter without human intervention.

That’s what this article is about.

By the end, you’ll have a working system that:

Watches a directory for new training data
Validates and preprocesses it into the right format
Kicks off a LoRA training run on your local GPU
Merges and quantizes the result
Deploys it to a local Ollama endpoint automatically
Logs everything so you can audit what changed and when

The reference implementation lives at https://github.com/huggingface/peft. We’ll be using PEFT heavily.

Prerequisites

Linux host (or WSL2 with CUDA passthrough, though native Linux is strongly preferred)
NVIDIA GPU with at least 12GB VRAM (16GB+ recommended for 7B models)
CUDA 12.x installed, nvidia-smi returns something useful
Python 3.11+, pip, and virtualenv
Docker and Docker Compose (for the serving layer)
Ollama installed and running (ollama serve)
Git

Hardware reality check: you can fine-tune a Mistral 7B with 4-bit quantized base weights on 12GB VRAM. A Llama 3 8B fits the same envelope. For 13B models, 24GB is the practical floor. Below 12GB, look at 3B models or expect gradient checkpointing to slow things down significantly.

Architecture Overview

The pipeline has four stages that chain together automatically:

[Data Watcher] → [Preprocessor] → [Trainer] → [Deployer]

Each stage is a standalone Python script. A small orchestrator daemon wires them together. The whole thing runs as a systemd service so it survives reboots and failures.

Stage 1: Dataset Preparation

LoRA training expects instruction-tuned format. The dominant standard right now is the ChatML template or Alpaca format, depending on your base model. We’ll target ChatML since it’s what most modern models (Mistral, Llama 3, Qwen) expect.

Create your project structure first:

mkdir -p ~/lora-pipeline/{data/{raw,processed},adapters,logs,scripts}
cd ~/lora-pipeline
python3 -m venv .venv && source .venv/bin/activate
pip install transformers peft datasets accelerate bitsandbytes tqdm

Your raw data goes into data/raw/ as JSONL files. Each line is a JSON object with at minimum a prompt and response field, or already-formatted messages arrays. The preprocessor handles both.

scripts/preprocess.py:

#!/usr/bin/env python3
"""
Preprocessor: converts raw JSONL → ChatML-formatted dataset for training.
Validates structure, filters short/malformed samples, deduplicates.
"""
import json
import hashlib
import argparse
from pathlib import Path
from datasets import Dataset

CHATML_TEMPLATE = "<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{response}<|im_end|>"
MIN_CHARS = 50  # discard samples shorter than this
MAX_CHARS = 4096  # discard samples longer than this (adjust for your context window)


def hash_sample(text: str) -> str:
    return hashlib.md5(text.encode()).hexdigest()


def load_raw(path: Path) -> list[dict]:
    samples = []
    seen = set()
    with open(path) as f:
        for lineno, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
            except json.JSONDecodeError as e:
                print(f"  [WARN] line {lineno}: {e}")
                continue

            # Support both flat {prompt, response} and messages array format
            if "messages" in obj:
                turns = obj["messages"]
                if len(turns) < 2:
                    continue
                prompt = turns[0].get("content", "")
                response = turns[1].get("content", "")
            else:
                prompt = obj.get("prompt", "").strip()
                response = obj.get("response", "").strip()

            text = CHATML_TEMPLATE.format(prompt=prompt, response=response)

            if len(text) < MIN_CHARS or len(text) > MAX_CHARS:
                continue

            h = hash_sample(text)
            if h in seen:
                continue
            seen.add(h)
            samples.append({"text": text})

    return samples


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input-dir", required=True)
    parser.add_argument("--output-path", required=True)
    args = parser.parse_args()

    all_samples = []
    for raw_file in Path(args.input_dir).glob("*.jsonl"):
        print(f"Processing {raw_file.name}...")
        samples = load_raw(raw_file)
        print(f"  → {len(samples)} valid samples")
        all_samples.extend(samples)

    if not all_samples:
        raise ValueError("No valid samples found. Check your input files.")

    dataset = Dataset.from_list(all_samples)
    dataset = dataset.train_test_split(test_size=0.05, seed=42)
    dataset.save_to_disk(args.output_path)
    print(f"\nDataset saved: {len(all_samples)} total samples → {args.output_path}")


if __name__ == "__main__":
    main()

Gotcha #1: ChatML tokens (<|im_start|>, <|im_end|>) must be in the tokenizer’s vocabulary. For Mistral, they aren’t by default — you need to add them. The trainer script below handles this, but if you skip that step, your model will never learn proper turn boundaries and will generate forever.

Stage 2: LoRA Training Loop

The training script is the core of the pipeline. It loads the base model in 4-bit quantized form (QLoRA), attaches LoRA adapters to the attention layers, trains, and saves only the adapter weights — not the full 15GB model.

scripts/train.py:

#!/usr/bin/env python3
"""
QLoRA trainer: loads base model in 4-bit, trains LoRA adapters, saves to adapters/.
"""
import os
import sys
import json
import argparse
from datetime import datetime
from pathlib import Path

import torch
from datasets import load_from_disk
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling,
    Trainer,
)
from peft import LoraConfig, get_peft_model, TaskType

# ──────────────────────────────────────────────
# Defaults — override via CLI or config.json
# ──────────────────────────────────────────────
DEFAULTS = {
    "base_model": "mistralai/Mistral-7B-Instruct-v0.3",
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "learning_rate": 2e-4,
    "num_epochs": 3,
    "batch_size": 2,
    "gradient_accumulation_steps": 8,
    "max_seq_length": 2048,
    "warmup_ratio": 0.03,
}


def load_config(config_path: str) -> dict:
    if config_path and Path(config_path).exists():
        with open(config_path) as f:
            return {**DEFAULTS, **json.load(f)}
    return DEFAULTS.copy()


def add_special_tokens_if_missing(tokenizer, model):
    special_tokens = ["<|im_start|>", "<|im_end|>"]
    existing = set(tokenizer.get_vocab().keys())
    missing = [t for t in special_tokens if t not in existing]
    if missing:
        tokenizer.add_special_tokens({"additional_special_tokens": missing})
        model.resize_token_embeddings(len(tokenizer))
        print(f"Added special tokens: {missing}")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset-path", required=True)
    parser.add_argument("--output-dir", required=True)
    parser.add_argument("--config", default="")
    parser.add_argument("--run-name", default=datetime.now().strftime("%Y%m%d_%H%M%S"))
    args = parser.parse_args()

    cfg = load_config(args.config)
    adapter_path = Path(args.output_dir) / args.run_name
    adapter_path.mkdir(parents=True, exist_ok=True)

    print(f"Run: {args.run_name}")
    print(f"Base model: {cfg['base_model']}")

    # ── Load model in 4-bit (QLoRA) ──
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    tokenizer = AutoTokenizer.from_pretrained(cfg["base_model"], use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        cfg["base_model"],
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    model.config.use_cache = False  # required for gradient checkpointing
    model.enable_input_require_grads()

    add_special_tokens_if_missing(tokenizer, model)

    # ── Attach LoRA adapters ──
    lora_config = LoraConfig(
        r=cfg["lora_r"],
        lora_alpha=cfg["lora_alpha"],
        target_modules=cfg["target_modules"],
        lora_dropout=cfg["lora_dropout"],
        bias="none",
        task_type=TaskType.CAUSAL_LM,
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    # ── Load and tokenize dataset ──
    dataset = load_from_disk(args.dataset_path)

    def tokenize(examples):
        return tokenizer(
            examples["text"],
            truncation=True,
            max_length=cfg["max_seq_length"],
            padding=False,
        )

    tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

    # ── Training arguments ──
    training_args = TrainingArguments(
        output_dir=str(adapter_path / "checkpoints"),
        num_train_epochs=cfg["num_epochs"],
        per_device_train_batch_size=cfg["batch_size"],
        gradient_accumulation_steps=cfg["gradient_accumulation_steps"],
        gradient_checkpointing=True,
        learning_rate=cfg["learning_rate"],
        warmup_ratio=cfg["warmup_ratio"],
        lr_scheduler_type="cosine",
        optim="paged_adamw_8bit",
        fp16=False,
        bf16=True,
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        load_best_model_at_end=True,
        report_to="none",  # set to "wandb" if you want tracking
        run_name=args.run_name,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["test"],
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()
    model.save_pretrained(str(adapter_path / "adapter"))
    tokenizer.save_pretrained(str(adapter_path / "adapter"))

    # Write a manifest so the deployer knows what we built
    manifest = {
        "run_name": args.run_name,
        "base_model": cfg["base_model"],
        "adapter_path": str(adapter_path / "adapter"),
        "trained_at": datetime.now().isoformat(),
        "train_samples": len(tokenized["train"]),
    }
    with open(adapter_path / "manifest.json", "w") as f:
        json.dump(manifest, f, indent=2)

    print(f"\nAdapter saved to: {adapter_path / 'adapter'}")
    print(f"Manifest: {adapter_path / 'manifest.json'}")


if __name__ == "__main__":
    main()

Gotcha #2: gradient_checkpointing=True with PEFT requires model.enable_input_require_grads() — without that call, the gradients on the first layer’s input simply don’t exist, and training silently produces garbage. This bit me for an afternoon.

Gotcha #3: paged_adamw_8bit is the optimizer that makes QLoRA practical on 12GB cards. Standard AdamW will OOM. If you’re on 24GB and want faster convergence, adamw_torch_fused is a better choice.

Stage 3: Automated Deployment with Ollama

Once the adapter is trained, you need to merge it into a full model (or leave it as a standalone adapter if your serving layer supports PEFT natively) and register it with Ollama.

Ollama doesn’t support raw PEFT adapters directly. You need to either merge the adapter into the base model or use a GGUF-based workflow. The latter is more practical on local hardware because GGUF allows per-layer offloading.

scripts/deploy.py:

#!/usr/bin/env python3
"""
Deployer: merges LoRA adapter into base model, converts to GGUF, registers with Ollama.
Requires llama.cpp's convert script on PATH as 'convert_hf_to_gguf.py'.
"""
import json
import shutil
import subprocess
import argparse
from pathlib import Path
from datetime import datetime

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel


def merge_adapter(manifest: dict, merged_dir: Path):
    print("Merging adapter into base model...")
    base_model = manifest["base_model"]
    adapter_path = manifest["adapter_path"]

    tokenizer = AutoTokenizer.from_pretrained(adapter_path)

    base = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.bfloat16,
        device_map="cpu",  # merge on CPU to avoid VRAM pressure
    )
    model = PeftModel.from_pretrained(base, adapter_path)
    model = model.merge_and_unload()

    model.save_pretrained(str(merged_dir), safe_serialization=True)
    tokenizer.save_pretrained(str(merged_dir))
    print(f"Merged model saved to {merged_dir}")


def convert_to_gguf(merged_dir: Path, gguf_path: Path, quant: str = "q4_k_m"):
    print(f"Converting to GGUF ({quant})...")
    # Step 1: HF → fp16 GGUF
    fp16_gguf = gguf_path.parent / "model-fp16.gguf"
    subprocess.run(
        ["python3", "convert_hf_to_gguf.py", str(merged_dir), "--outfile", str(fp16_gguf)],
        check=True,
    )
    # Step 2: Quantize with llama-quantize
    subprocess.run(
        ["llama-quantize", str(fp16_gguf), str(gguf_path), quant],
        check=True,
    )
    fp16_gguf.unlink()  # clean up the large intermediate file
    print(f"GGUF saved to {gguf_path}")


def register_with_ollama(gguf_path: Path, model_name: str):
    modelfile_content = f"""FROM {gguf_path}
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
PARAMETER top_p 0.9
"""
    modelfile_path = gguf_path.parent / "Modelfile"
    modelfile_path.write_text(modelfile_content)

    print(f"Registering {model_name} with Ollama...")
    subprocess.run(
        ["ollama", "create", model_name, "-f", str(modelfile_path)],
        check=True,
    )
    print(f"Model {model_name} is live. Test it: ollama run {model_name}")


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--manifest", required=True)
    parser.add_argument("--work-dir", required=True)
    parser.add_argument("--model-name", default="finetuned-local")
    parser.add_argument("--quant", default="q4_k_m")
    args = parser.parse_args()

    with open(args.manifest) as f:
        manifest = json.load(f)

    work_dir = Path(args.work_dir) / manifest["run_name"]
    work_dir.mkdir(parents=True, exist_ok=True)

    merged_dir = work_dir / "merged"
    gguf_path = work_dir / "model.gguf"

    merge_adapter(manifest, merged_dir)
    convert_to_gguf(merged_dir, gguf_path, args.quant)
    register_with_ollama(gguf_path, args.model_name)

    # Clean up the large merged weights — GGUF is what we serve
    shutil.rmtree(merged_dir)
    print(f"\nDeployment complete: {datetime.now().isoformat()}")


if __name__ == "__main__":
    main()

Gotcha #4: The merge happens on CPU because loading two large models on GPU simultaneously will OOM on anything below 48GB. It’s slow (15-30 minutes for a 7B model), but it works reliably. Set device_map="cpu" for the base model load.

Gotcha #5: llama-quantize and convert_hf_to_gguf.py come from building llama.cpp from source. Most distributions don’t package these. Build llama.cpp once and add its build/bin/ to your PATH:

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -B build -DLLAMA_CUDA=ON && cmake --build build -j$(nproc)
echo 'export PATH="$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
ln -s ~/llama.cpp/convert_hf_to_gguf.py ~/bin/

Stage 4: The Orchestrator

The three scripts above are building blocks. The orchestrator watches for new data and chains them together.

scripts/orchestrator.py:

#!/usr/bin/env python3
"""
Pipeline orchestrator: watches data/raw/ for new .jsonl files and runs the full pipeline.
Uses a lockfile to prevent concurrent runs.
"""
import os
import time
import fcntl
import subprocess
import logging
from pathlib import Path
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
    handlers=[
        logging.FileHandler("logs/orchestrator.log"),
        logging.StreamHandler(),
    ],
)
log = logging.getLogger(__name__)

BASE_DIR = Path(__file__).parent.parent
RAW_DIR = BASE_DIR / "data/raw"
PROCESSED_DIR = BASE_DIR / "data/processed"
ADAPTERS_DIR = BASE_DIR / "adapters"
TRIGGER_FILE = BASE_DIR / ".pipeline_trigger"
LOCK_FILE = BASE_DIR / ".pipeline.lock"
VENV_PYTHON = BASE_DIR / ".venv/bin/python3"
MODEL_NAME = os.getenv("PIPELINE_MODEL_NAME", "finetuned-local")


def run(cmd: list, **kwargs):
    log.info("Running: %s", " ".join(str(c) for c in cmd))
    result = subprocess.run(cmd, **kwargs)
    if result.returncode != 0:
        raise RuntimeError(f"Command failed with exit code {result.returncode}")
    return result


def pipeline():
    run_name = datetime.now().strftime("%Y%m%d_%H%M%S")
    processed_path = PROCESSED_DIR / run_name

    log.info("=== Pipeline run: %s ===", run_name)

    # Stage 1: Preprocess
    run([VENV_PYTHON, "scripts/preprocess.py",
         "--input-dir", str(RAW_DIR),
         "--output-path", str(processed_path)])

    # Stage 2: Train
    manifest_path = ADAPTERS_DIR / run_name / "manifest.json"
    run([VENV_PYTHON, "scripts/train.py",
         "--dataset-path", str(processed_path),
         "--output-dir", str(ADAPTERS_DIR),
         "--run-name", run_name])

    # Stage 3: Deploy
    run([VENV_PYTHON, "scripts/deploy.py",
         "--manifest", str(manifest_path),
         "--work-dir", str(ADAPTERS_DIR / "deploy"),
         "--model-name", MODEL_NAME])

    # Archive processed data to avoid reprocessing
    processed_path.rename(PROCESSED_DIR / f"{run_name}.done")
    log.info("=== Pipeline complete: %s ===", run_name)


def watch():
    log.info("Watching %s for new data...", RAW_DIR)
    while True:
        if TRIGGER_FILE.exists():
            lock_fd = open(LOCK_FILE, "w")
            try:
                fcntl.flock(lock_fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
                TRIGGER_FILE.unlink()
                try:
                    pipeline()
                except Exception as e:
                    log.error("Pipeline failed: %s", e, exc_info=True)
                finally:
                    fcntl.flock(lock_fd, fcntl.LOCK_UN)
            except BlockingIOError:
                log.info("Another run already in progress, skipping trigger.")
            finally:
                lock_fd.close()
        time.sleep(30)


if __name__ == "__main__":
    watch()

To trigger a run manually or from a cron job or a file watcher:

touch ~/lora-pipeline/.pipeline_trigger

You can also set up inotifywait to trigger automatically on new JSONL files:

# Add to a separate tmux pane or systemd service
while inotifywait -e close_write ~/lora-pipeline/data/raw/; do
    touch ~/lora-pipeline/.pipeline_trigger
done

Systemd Service

Wire everything up as a persistent service:

/etc/systemd/system/lora-pipeline.service:

[Unit]
Description=LoRA Fine-Tuning Pipeline Orchestrator
After=network.target ollama.service

[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username/lora-pipeline
ExecStart=/home/your-username/lora-pipeline/.venv/bin/python3 scripts/orchestrator.py
Restart=on-failure
RestartSec=30
Environment=PIPELINE_MODEL_NAME=my-domain-model
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

sudo systemctl enable --now lora-pipeline
journalctl -u lora-pipeline -f

Production-Ready Additions

Eval before deploy: Don’t ship blindly. Write a small eval script that scores the new adapter on a held-out eval set (perplexity or a handful of gold-standard prompt/completion pairs). Gate the deploy step on passing a minimum score threshold. One bad data batch will cause silent regressions otherwise.

Training config versioning: Store a config.json alongside each run in the adapters directory. When something breaks, you’ll want to know the exact hyperparameters used, not the ones currently in your script.

VRAM headroom monitoring: Add a pre-flight check before training starts. If nvidia-smi shows less than 10GB free, abort with a log entry. Shared VRAM (game running in another session, another inference process) will cause cryptic OOM kills mid-training.

Adapter versioning: Don’t overwrite the Ollama model on every run. Use timestamped names (my-model:20260520) and keep a symlink my-model:latest pointing to the current best. This lets you roll back in 10 seconds with ollama create my-model:latest -f Modelfile-prev.

Data quality gate: Before preprocessing, count samples, check average response length, and flag if variance is too high or too low. A batch of 50-character responses training against a model that normally produces 500-character responses will ruin your loss curve without any obvious error.

Realistic Timelines

On an RTX 4090 with a 7B model and a dataset of ~5,000 samples, expect:

Preprocessing: 2-3 minutes
Training (3 epochs): 45-90 minutes
Merge + GGUF conversion: 20-30 minutes
Total wall clock: under 2 hours per cycle

On a 3090 with a 7B model: add ~40% to training time. On a 3080 (10GB), you’re looking at a 3B model or significant context window reduction.

Final Thoughts

This setup has a real superpower: feedback loops. Your internal tool generates text, users flag bad outputs, flagged outputs become training samples, the pipeline retrains overnight, and the model gets better without manual intervention. That’s the actual value here — not one-shot fine-tuning, but a continuously improving specialized model that you control entirely.

The weak points to watch are data quality and eval rigor. Anyone can build a pipeline that retrains. The discipline is building one that only deploys improvements. Add your eval gate before this sees production traffic.