Safety

Guardrails for AI coding agents: what should never run

May 19, 2026 · 7 min

An AI coding agent is a process on your machine that runs shell commands. Almost all of them are fine. This piece is about the machinery for the ones that are not: what a real block rule looks like, the order rules must run in, and the design mistakes we made on the way.

A block rule is a regex, a reason, and a way out

Guard's rules live in one versioned file, guard/rules.json. Here is G-007, verbatim:

{
  "id": "G-007",
  "pattern": "git push.*--force([[:space:]]|$)",
  "category": "history-destruction",
  "description": "Force push overwrites remote history",
  "alternative": "git push --force-with-lease
                  (safer, fails if remote changed)"
}

Three design decisions are visible in those six lines. The pattern anchors --force to a word boundary, so --force-with-lease, the safe variant, does not match its own block rule. The category exists so audit logs can be aggregated. And the alternativefield is mandatory culture: a guardrail that only says no trains the agent, and the human, to look for ways around it. Guard's blocks are deny-and-continue: the agent is told why and told what to do instead, and the sprint keeps moving.

There are 36 block rules and 9 warn rules as of this writing, spread across categories you can probably guess: mass deletion (G-001 starts at rm -rf /), database destruction (G-014, DROP TABLE), remote code execution, history destruction, safety bypass (G-027, --no-verify), and secrets access. The counts will drift; the file in the repo is the source of truth, on purpose. Our own docs are not allowed to hardcode the number.

Order is the security property

Guard evaluates in tiers: block rules, then an allowlist of boring commands (git status, ls, jq, about thirty others), then warn rules. The ordering sounds like an implementation detail. It is the whole game, and we learned it the hard way: an earlier version consulted the allowlist first, and the allowlist contained git and cat. A safe binary with a dangerous argument, git push --force, cat credentials.json, sailed through on the binary's reputation. A security review caught it; the fix made global block rules run before any allowlist short-circuit, and we now treat "blocks run first" as a frozen invariant with its own regression tests.

The rule nobody thinks they need

The most interesting category is not deletion. It is G-035, secrets access, which blocks read commands, cat, grep, jq, even vim, when the target basename looks like a credential file: service-account*.json, firebase-adminsdk*.json, aws_credentials, client_secret*.json.

Why block a read? Because for an agent, reading is publishing. Whatever the agent reads enters the conversation transcript, and the transcript is the least protected place that secret will ever sit: it gets logged, summarized, sometimes sent to telemetry or pasted into an issue. The bytes did not leave your machine when the agent read them. They left when the transcript did. Example templates (credentials.example.json, .env.example) stay readable, because the goal is stopping leaks, not stopping work.

How the block actually lands

On hosts with hook support, Guard runs as a PreToolUse hook: the host calls check-dangerous.sh with the command before executing it. On a match, the hook denies, echoes the offending command back (so an injected command incriminates itself in the audit log), names the rule, and prints the alternative. File writes go through a second hook, check-write.sh, with its own narrow denylist: protected paths and credential basenames, after resolving symlinks, so a link planted at ./notes.json pointing at ~/.aws/credentials does not walk around the rule.

The honest caveat, stated plainly: the hard block exists where the host supports hooks. Claude Code does. On Cursor, OpenAI Codex, OpenCode, and Gemini CLI, the same rules ship as guided instructions the agent is told to follow, which is materially weaker. Each adapter's file in the repo says which behavior you get. If a tool tells you its guardrails work identically everywhere, ask to see the enforcement.

What this buys, and what it does not

Guard is not a sandbox, and it does not make an agent safe to run unattended against production. What it does is convert the most expensive failure modes, the irreversible ones, from "the model decided not to" into "a regex decided it cannot". Models are probabilistic; under context pressure they renegotiate everything, including your safety instructions. A rules file does not renegotiate. That asymmetry is the entire reason to put controls at the action layer instead of in the prompt.

Read all the rules → · Why the input side cannot be filtered →

← All content · Install nanostack →