Your Home Server, on Autopilot: Building an Autonomous Maintenance Agent with Cron + Claude CLI

Running a home server is mostly fine — until it isn’t. Disk fills up at 3 AM. A container crashes silently. SSL certs expire because you forgot to check. Log rotation skips a service. You spend a Saturday afternoon not fixing the problem, but discovering there was a problem.

Monitoring dashboards help. Alert rules help more. But what you actually want is something that notices the problem and fixes it while you sleep. Not a brittle shell script with 200 if branches, not a full-blown Ansible playbook for every edge case — something that can reason about what it sees and take targeted corrective action.

That’s what this guide builds: a cron-scheduled autonomous agent that inspects your server, reasons over what it finds, and takes safe, auditable corrective actions using Claude CLI. The key constraint that makes this usable in production (or on a box you actually care about) is a write-protection layer that hard-limits what the agent can touch without human approval.


The Architecture in One Paragraph

Cron fires a shell wrapper every N minutes. The wrapper collects system state (disk, services, logs, cert expiry, container health) and pipes it into claude with a tightly scoped system prompt. Claude produces a list of corrective actions in a structured format. A second script parses that output, checks each action against an allowlist, executes what’s permitted, and logs everything — including what it refused. That’s it. No agent framework, no LangChain, no vector database.


Prerequisites

  • A Linux home server (Debian/Ubuntu assumed, adapt freely)
  • claude CLI installed and authenticated — Anthropic’s Claude Code CLI
  • jq, curl, systemctl, docker (or podman) available
  • Basic familiarity with cron and bash

Install the CLI if you haven’t:

npm install -g @anthropic-ai/claude-code
claude --version

Authenticate once interactively:

claude
# Follow the OAuth flow — credentials are cached in ~/.claude/

Step 1: Define What the Agent Is Allowed to Do

This is the single most important step. Before writing any automation, write your allowlist. Every action the agent can take must be explicit. Anything not on the list gets refused and logged — full stop.

Create /opt/homeagent/allowlist.conf:

# Each line: ACTION_TYPE:PATTERN
# Patterns support basic globs via bash's [[ =~ ]] regex matching.

# Restart a specific set of services only
SYSTEMCTL_RESTART:^(nginx|caddy|docker|postgresql|redis|fail2ban)$

# Log rotation — only in /var/log
LOGROTATE:/var/log/.*

# Docker container restart by name
DOCKER_RESTART:^[a-zA-Z0-9_-]+$

# Disk cleanup — only /tmp and a defined cache path
RM_SAFE:^(/tmp/|/var/cache/apt/archives/).*

# Certificate renewal via certbot
CERTBOT_RENEW:.*

# Apt package upgrade — security patches only
APT_UPGRADE:security

Tight patterns. No wildcards that swallow the system. You’ll expand this over time as you build trust in the agent’s judgment.


Step 2: Write the State Collector

The agent needs context to reason over. This script gathers it and outputs clean, terse text — not megabytes of log dumps.

/opt/homeagent/collect-state.sh:

#!/usr/bin/env bash
# Collects a lightweight server health snapshot for the maintenance agent.
set -euo pipefail

echo "=== TIMESTAMP ==="
date -u +"%Y-%m-%dT%H:%M:%SZ"

echo ""
echo "=== DISK USAGE (>70%) ==="
df -h --output=source,pcent,target | awk 'NR==1 || $2+0 > 70'

echo ""
echo "=== MEMORY ==="
free -h | grep -E "^(Mem|Swap)"

echo ""
echo "=== FAILED SYSTEMD UNITS ==="
systemctl list-units --state=failed --no-legend --no-pager 2>/dev/null || echo "none"

echo ""
echo "=== DOCKER UNHEALTHY / EXITED CONTAINERS ==="
docker ps -a --filter "status=exited" --filter "health=unhealthy" \
  --format "{{.Names}}\t{{.Status}}\t{{.Image}}" 2>/dev/null || echo "none"

echo ""
echo "=== SSL CERT EXPIRY (next 14 days) ==="
for domain in $(ls /etc/letsencrypt/live/ 2>/dev/null); do
  expiry=$(openssl x509 -enddate -noout \
    -in "/etc/letsencrypt/live/${domain}/fullchain.pem" 2>/dev/null \
    | cut -d= -f2)
  exp_epoch=$(date -d "$expiry" +%s 2>/dev/null || echo 0)
  now_epoch=$(date +%s)
  days_left=$(( (exp_epoch - now_epoch) / 86400 ))
  if [[ $days_left -lt 14 ]]; then
    echo "$domain expires in ${days_left} days ($expiry)"
  fi
done

echo ""
echo "=== RECENT HIGH-SEVERITY LOG LINES (last 15 min) ==="
# Limit to 40 lines max to avoid flooding the prompt
journalctl -p err --since "15 minutes ago" --no-pager -q 2>/dev/null \
  | tail -40 || echo "none"

echo ""
echo "=== LOAD AVERAGE ==="
uptime

Make it executable:

chmod +x /opt/homeagent/collect-state.sh

Run it manually once to verify the output looks sane. If a section returns garbage, fix it before hooking it to the agent.


Step 3: Write the System Prompt

This is your contract with the model. Be explicit, be restrictive, and leave no ambiguity about what format you expect back.

/opt/homeagent/system-prompt.txt:

You are a home server maintenance agent. You receive a health snapshot and produce corrective actions.

Rules:
- Output ONLY a JSON array of action objects. No prose, no explanation, no markdown fences.
- If nothing needs fixing, output an empty array: []
- Never invent actions not listed in the allowed types below.
- Prefer the least invasive action. Do not restart a service if a log rotation is sufficient.
- Never chain more than 3 actions in a single response.

Allowed action types and their required fields:
  { "type": "SYSTEMCTL_RESTART", "target": "<service-name>", "reason": "<one line>" }
  { "type": "DOCKER_RESTART",    "target": "<container-name>", "reason": "<one line>" }
  { "type": "LOGROTATE",         "target": "<log-path>", "reason": "<one line>" }
  { "type": "RM_SAFE",           "target": "<path>", "reason": "<one line>" }
  { "type": "CERTBOT_RENEW",     "target": "<domain>", "reason": "<one line>" }
  { "type": "APT_UPGRADE",       "target": "security", "reason": "<one line>" }
  { "type": "ALERT_ONLY",        "target": "human", "reason": "<describe issue that needs human attention>" }

Use ALERT_ONLY for anything outside your allowed scope: filesystem corruption, hardware issues,
unknown processes, unexplained high load, anything that feels wrong but you cannot safely fix.

The ALERT_ONLY type is important. It gives the model a safe exit valve when it recognizes something is wrong but shouldn’t touch it. You’ll hook this to email or a notification in a later step.


Step 4: The Action Executor with Allowlist Enforcement

This is where the write-protection lives. The executor parses the JSON, validates each action against the allowlist, and only executes what matches.

/opt/homeagent/execute-actions.sh:

#!/usr/bin/env bash
# Parses agent JSON output and executes only allowlisted actions.
set -euo pipefail

ALLOWLIST="/opt/homeagent/allowlist.conf"
AUDIT_LOG="/var/log/homeagent/audit.log"
NOTIFY_EMAIL="${ALERT_EMAIL:-}"  # set via environment or leave blank

mkdir -p "$(dirname "$AUDIT_LOG")"

log() {
  echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] $*" | tee -a "$AUDIT_LOG"
}

is_allowed() {
  local action_type="$1"
  local target="$2"

  while IFS=: read -r allowed_type allowed_pattern; do
    # Skip comments and blank lines
    [[ "$allowed_type" =~ ^# ]] && continue
    [[ -z "$allowed_type" ]] && continue

    if [[ "$action_type" == "$allowed_type" ]] && [[ "$target" =~ $allowed_pattern ]]; then
      return 0
    fi
  done < "$ALLOWLIST"

  return 1
}

execute_action() {
  local type="$1"
  local target="$2"
  local reason="$3"

  log "EXECUTING: type=$type target=$target reason=$reason"

  case "$type" in
    SYSTEMCTL_RESTART)
      systemctl restart "$target"
      ;;
    DOCKER_RESTART)
      docker restart "$target"
      ;;
    LOGROTATE)
      logrotate -f "$target"
      ;;
    RM_SAFE)
      # Extra safety: never rm a path that doesn't start with /tmp or /var/cache
      if [[ "$target" =~ ^(/tmp/|/var/cache/) ]]; then
        rm -rf "$target"
      else
        log "REFUSED RM_SAFE: path $target failed hardcoded safety check"
        return 1
      fi
      ;;
    CERTBOT_RENEW)
      certbot renew --cert-name "$target" --quiet --non-interactive
      ;;
    APT_UPGRADE)
      apt-get update -qq && \
        apt-get upgrade -y --only-upgrade \
          $(apt-get upgrade --dry-run 2>/dev/null | grep "^Inst" | grep -i security \
            | awk '{print $2}' | tr '\n' ' ')
      ;;
    ALERT_ONLY)
      log "ALERT: $reason"
      if [[ -n "$NOTIFY_EMAIL" ]]; then
        echo "$reason" | mail -s "[homeagent] Manual attention needed" "$NOTIFY_EMAIL"
      fi
      ;;
    *)
      log "UNKNOWN action type: $type — skipped"
      ;;
  esac

  log "DONE: type=$type target=$target"
}

# Reads JSON array from stdin
ACTIONS_JSON="$(cat)"

action_count=$(echo "$ACTIONS_JSON" | jq 'length')
log "Agent returned $action_count action(s)"

for i in $(seq 0 $((action_count - 1))); do
  type=$(echo "$ACTIONS_JSON" | jq -r ".[$i].type")
  target=$(echo "$ACTIONS_JSON" | jq -r ".[$i].target")
  reason=$(echo "$ACTIONS_JSON" | jq -r ".[$i].reason")

  if is_allowed "$type" "$target"; then
    execute_action "$type" "$target" "$reason"
  else
    log "REFUSED: type=$type target=$target — not in allowlist"
  fi
done
chmod +x /opt/homeagent/execute-actions.sh

Step 5: The Main Agent Loop

/opt/homeagent/run-agent.sh:

#!/usr/bin/env bash
# Main entry point: collect state → call Claude → execute actions.
set -euo pipefail

AGENT_DIR="/opt/homeagent"
SYSTEM_PROMPT_FILE="$AGENT_DIR/system-prompt.txt"
LOG_DIR="/var/log/homeagent"
RUN_LOG="$LOG_DIR/runs.log"

mkdir -p "$LOG_DIR"

log() {
  echo "[$(date -u +"%Y-%m-%dT%H:%M:%SZ")] $*" | tee -a "$RUN_LOG"
}

log "=== Agent run starting ==="

# Collect system state
STATE=$("$AGENT_DIR/collect-state.sh" 2>&1)
if [[ -z "$STATE" ]]; then
  log "State collection returned empty — aborting"
  exit 1
fi

log "State collected ($(echo "$STATE" | wc -l) lines)"

# Call Claude — pipe state as the user message, use -p for non-interactive mode
RESPONSE=$(echo "$STATE" | claude \
  --system-prompt-file "$SYSTEM_PROMPT_FILE" \
  --output-format text \
  -p "Analyze this server health snapshot and return your corrective actions JSON." \
  2>"$LOG_DIR/claude-stderr.log")

log "Claude responded ($(echo "$RESPONSE" | wc -c) bytes)"

# Validate that the response is parseable JSON
if ! echo "$RESPONSE" | jq empty 2>/dev/null; then
  log "ERROR: Claude returned non-JSON response. Check $LOG_DIR/claude-stderr.log"
  log "Raw response: $(echo "$RESPONSE" | head -5)"
  exit 1
fi

# Execute actions with allowlist enforcement
echo "$RESPONSE" | "$AGENT_DIR/execute-actions.sh"

log "=== Agent run complete ==="
chmod +x /opt/homeagent/run-agent.sh

Step 6: Wire It to Cron

You want frequent-enough runs to catch issues fast, but not so frequent that you burn API tokens on a healthy server. Every 15 minutes is a solid default. During quiet hours, you might drop to 30.

crontab -e

Add:

# Home server maintenance agent — runs every 15 minutes
*/15 * * * * /opt/homeagent/run-agent.sh >> /var/log/homeagent/cron.log 2>&1

Or if you want a smarter schedule using systemd timers instead of raw cron (better logging, dependency handling):

# /etc/systemd/system/homeagent.service
[Unit]
Description=Home Server Maintenance Agent
After=network-online.target

[Service]
Type=oneshot
ExecStart=/opt/homeagent/run-agent.sh
User=root
EnvironmentFile=-/etc/homeagent.env
StandardOutput=journal
StandardError=journal
# /etc/systemd/system/homeagent.timer
[Unit]
Description=Run home server maintenance agent every 15 minutes

[Timer]
OnBootSec=5min
OnUnitActiveSec=15min
Persistent=true

[Install]
WantedBy=timers.target
systemctl daemon-reload
systemctl enable --now homeagent.timer
systemctl list-timers homeagent.timer

Gotchas

The model will occasionally hallucinate action types. That’s fine — your executor refuses anything not in the allowlist. The audit log catches every refusal. Review audit.log weekly for the first month to tune your prompt if needed.

Never run this as root without the allowlist layer. The point of this whole architecture is that claude itself is untrusted. It’s a language model, not a sysadmin. The allowlist is what makes the agent safe — strip it and you have an LLM with root access and no guardrails.

API costs add up on noisy servers. If your server is already struggling (constant OOM kills, bad disk), the state collector will produce large log dumps and you’ll burn tokens fast. Add a maximum line cap to the log collection section and consider a "health gate" — only call Claude if at least one anomaly is detected in the state output.

Token usage on the claude-cli side. By default, claude -p in non-interactive mode will still read your CLAUDE.md if it’s in the working directory. Either run the agent from a dedicated directory without a CLAUDE.md, or pass --no-config if supported by your version.

Certificate renewal needs certbot configured correctly first. The agent can trigger certbot renew, but if certbot isn’t set up (webroot, DNS plugin, etc.), the renewal will fail silently. Test renewal manually before delegating it.

Don’t put secret env vars in crontab. If your ALERT_EMAIL or API key need to be in the environment, use EnvironmentFile=/etc/homeagent.env (mode 0600, owned by root) with the systemd service approach. Crontab is world-readable on many systems.


Production-Ready Additions

Idempotency tracking. Before executing an action, check if the same action was executed in the last N minutes. Avoid restarting nginx 4 times in an hour if it keeps crashing — that’s a symptom, not a fix. Add a simple state file: echo "$type:$target:$(date +%s)" >> /var/log/homeagent/recent-actions.

Dry-run mode. Run the agent with DRY_RUN=1 to log what it would do without executing. Useful for testing new allowlist entries:

# In execute-actions.sh, wrap execute_action calls:
if [[ "${DRY_RUN:-0}" == "1" ]]; then
  log "DRY-RUN: would execute type=$type target=$target"
else
  execute_action "$type" "$target" "$reason"
fi

Separate alert escalation. If the agent produces ALERT_ONLY three runs in a row for the same reason, that’s an escalation — email or push notification. Track alert frequency in a simple counter file.

Scope the allowlist per host. If you have multiple servers sharing the same agent code (via a Git repo), use hostname-specific allowlist files: allowlist.$(hostname -s).conf with a fallback to the default. Different machines have different risk tolerances.


What This Doesn’t Replace

This agent handles operational maintenance — keeping healthy services healthy. It’s not a replacement for a proper backup strategy, real monitoring (Prometheus + Alertmanager is still worth running), or security hardening. It’s also not magic: if your server has a deep problem, the agent will correctly emit ALERT_ONLY and stop, because that’s what you told it to do.

The value is in the middle ground: the hundred boring things that don’t need human judgment but do need to happen. Log rotation, container restarts, cert renewals, security patches. Stuff that eats 20 minutes of your weekend when it stacks up. Let the agent handle that layer, and save your attention for the actual problems.


Reviewing the Audit Trail

# What did the agent do in the last 24 hours?
grep "EXECUTING\|REFUSED\|ALERT" /var/log/homeagent/audit.log | tail -50

# How many times did it refuse vs execute?
grep -c "REFUSED" /var/log/homeagent/audit.log
grep -c "EXECUTING" /var/log/homeagent/audit.log

# Full run log
tail -f /var/log/homeagent/runs.log

The refusal count matters. If the agent is generating a lot of REFUSED entries, your state collector is surfacing issues the allowlist can’t handle — either expand the allowlist carefully, or add more ALERT_ONLY patterns to your system prompt. A high refusal rate is a signal, not a failure.


Final Thoughts

The agent is useful precisely because it’s constrained. An unconstrained LLM with root access is a liability. This architecture inverts that: the model is free to reason over whatever it sees, but its outputs are filtered through an explicit, human-readable allowlist before anything touches the system.

You get the reasoning capability of a language model for pattern recognition and judgment calls, and you keep a deterministic, auditable execution layer that a human wrote and can read in five minutes. That’s the right division of responsibility — not "AI does everything," but "AI decides what to propose, humans decided in advance what proposals are acceptable."

Start with a narrow allowlist. Expand it only after you’ve read a week of audit logs and trust what you see. This thing will run while you’re asleep, on vacation, or just not paying attention — build it like it will.

Leave a comment

👁 Views: 6,811 · Unique visitors: 10,762