Agent Safety Architecture

Overview

The Agent Safety Architecture ensures that the AI Dev Agent operates safely when interacting with the development environment.

Because the agent can:

modify files
execute shell commands
run builds and tests
interact with version control

there is a risk of unintended or destructive actions.

The safety system introduces guardrails that limit potential damage while still allowing the agent to operate autonomously.

This architecture defines mechanisms for:

restricting dangerous operations
protecting repository integrity
preventing runaway loops
ensuring human oversight when necessary

Safety Principles

The safety architecture is based on several core principles.

Least Privilege

The agent should only have the minimum permissions required to complete its task.

Example:

A documentation task should not require shell execution privileges.

Explicit Boundaries

The agent must operate within well-defined boundaries.

Examples include:

workspace directory
allowed tools
permitted commands

Observability

All potentially dangerous operations must be logged and traceable.

Human Override

Humans must always be able to intervene and stop agent execution.

Safety Layers

The safety system consists of multiple layers.

User Configuration
│
▼
Permission System
│
▼
Tool Guardrails
│
▼
Execution Sandbox
│
▼
Runtime Safety Checks

Each layer protects against different types of failures.

Permission System

The Permission System controls what the agent is allowed to do.

Permissions are defined as capability flags.

Example:

type PermissionSet = {
  filesystemRead: boolean
  filesystemWrite: boolean
  shellExecution: boolean
  gitWrite: boolean
  networkAccess: boolean
}

Different execution modes may use different permission sets.

Example modes:

Mode	Capabilities
read_only	read files only
development	read/write repository
ci	run tests and builds
full	unrestricted

This prevents unnecessary capabilities from being granted.

Workspace Isolation

The agent should only modify files within a defined workspace.

Example:

/workspace/project/

Attempts to access paths outside the workspace should be rejected.

Example blocked path:

/etc/passwd

Workspace isolation prevents accidental system modifications.

Filesystem Guardrails

Filesystem operations must be validated.

Checks include:

path normalization
workspace boundary validation
protected file detection

Example protected files:

.git/config
.env
secrets.json

The agent should not modify sensitive files unless explicitly allowed.

Git Safety

Version control operations require special safeguards.

Key protections include:

Protected Branches

The agent should not push directly to protected branches.

Example:

main
production

Instead, the agent should create feature branches.

Example:

agent/cache-implementation

Commit Review

Optionally, commits may require human approval before merging.

Change Limits

Large diffs may require additional validation.

Example:

Maximum files changed: 50

This prevents large accidental modifications.

Shell Command Restrictions

Shell execution is one of the most dangerous capabilities.

The system should enforce restrictions.

Examples:

Allowed commands:

npm test
bun test
cargo build

Blocked commands:

rm -rf /
shutdown
reboot

Command validation may use:

allowlists
blocklists
command pattern matching

Execution Sandboxing

Some operations should run in isolated environments.

Examples include:

running tests
executing build scripts
running user code

Sandboxing methods may include:

containers
restricted shells
temporary directories

This limits the impact of unsafe code execution.

Iteration Limits

Agent loops must have limits to prevent runaway execution.

Example configuration:

maxIterations: 50
maxRuntime: 10 minutes

If limits are exceeded, execution should stop.

The agent may then:

summarize progress
request human assistance

Failure Detection

The system should detect repeated failures.

Example signals:

repeated test failures
repeated build failures
repeated tool errors

Example policy:

If the same failure occurs 3 times,
pause execution and request review.

This prevents infinite error loops.

Destructive Action Detection

Some actions may have irreversible consequences.

Examples include:

deleting large numbers of files
force-pushing Git branches
overwriting configuration files

The system should detect these actions and require confirmation.

Example:

Deleting 200 files detected.
Human approval required.

Rate Limiting

The system may limit the frequency of certain operations.

Examples:

tool execution
repository modifications
external API requests

Rate limiting helps prevent runaway automation.

Logging and Auditing

All actions performed by the agent must be logged.

Important log fields include:

timestamp
tool invoked
input parameters
result
error messages

Logs allow developers to audit agent behavior.

Example log entry:

[2026-03-09 12:14:03]

tool: write_file
path: src/cache/CacheClient.ts
result: success

Human Approval Workflow

Some operations may require explicit approval.

Examples include:

large refactors
dependency upgrades
production configuration changes

Example workflow:

Agent proposes change
   ↓
Human review
   ↓
Approval granted
   ↓
Agent executes change

This maintains human control over high-risk operations.

Emergency Stop

The system must support immediate termination of agent execution.

Example triggers:

user interrupt
safety violation
resource exhaustion

An emergency stop should immediately halt:

tool execution
agent loop iterations
background processes

Observability

The safety system should expose metrics and alerts.

Examples include:

number of blocked actions
permission violations
safety-triggered pauses

These metrics help monitor system health.

Future Improvements

Possible enhancements include:

machine-learned risk detection
dynamic permission adjustment
anomaly detection in agent behavior
collaborative human-agent review systems

These improvements may further reduce operational risk.

Summary

The Agent Safety Architecture protects the development environment from unintended actions by the AI Dev Agent.

Key mechanisms include:

permission control
workspace isolation
filesystem guardrails
Git safety policies
shell command restrictions
sandboxed execution
iteration limits
human approval workflows

By combining these layers, the system allows autonomous development while maintaining safety and control.

Agent Safety Architecture ​

Overview ​

Safety Principles ​

Least Privilege ​

Explicit Boundaries ​

Observability ​

Human Override ​

Safety Layers ​

Permission System ​

Workspace Isolation ​

Filesystem Guardrails ​

Git Safety ​

Protected Branches ​

Commit Review ​

Change Limits ​

Shell Command Restrictions ​

Execution Sandboxing ​

Iteration Limits ​

Failure Detection ​

Destructive Action Detection ​

Rate Limiting ​

Logging and Auditing ​

Human Approval Workflow ​

Emergency Stop ​

Observability ​

Future Improvements ​

Summary ​

Agent Safety Architecture

Overview

Safety Principles

Least Privilege

Explicit Boundaries

Observability

Human Override

Safety Layers

Permission System

Workspace Isolation

Filesystem Guardrails

Git Safety

Protected Branches

Commit Review

Change Limits

Shell Command Restrictions

Execution Sandboxing

Iteration Limits

Failure Detection

Destructive Action Detection

Rate Limiting

Logging and Auditing

Human Approval Workflow

Emergency Stop

Observability

Future Improvements

Summary