Skip to content

Agent Safety Architecture

Overview

The Agent Safety Architecture ensures that the AI Dev Agent operates safely when interacting with the development environment.

Because the agent can:

  • modify files
  • execute shell commands
  • run builds and tests
  • interact with version control

there is a risk of unintended or destructive actions.

The safety system introduces guardrails that limit potential damage while still allowing the agent to operate autonomously.

This architecture defines mechanisms for:

  • restricting dangerous operations
  • protecting repository integrity
  • preventing runaway loops
  • ensuring human oversight when necessary

Safety Principles

The safety architecture is based on several core principles.

Least Privilege

The agent should only have the minimum permissions required to complete its task.

Example:

A documentation task should not require shell execution privileges.

Explicit Boundaries

The agent must operate within well-defined boundaries.

Examples include:

  • workspace directory
  • allowed tools
  • permitted commands

Observability

All potentially dangerous operations must be logged and traceable.

Human Override

Humans must always be able to intervene and stop agent execution.


Safety Layers

The safety system consists of multiple layers.

User Configuration


Permission System


Tool Guardrails


Execution Sandbox


Runtime Safety Checks

Each layer protects against different types of failures.


Permission System

The Permission System controls what the agent is allowed to do.

Permissions are defined as capability flags.

Example:

ts
type PermissionSet = {
  filesystemRead: boolean
  filesystemWrite: boolean
  shellExecution: boolean
  gitWrite: boolean
  networkAccess: boolean
}

Different execution modes may use different permission sets.

Example modes:

ModeCapabilities
read_onlyread files only
developmentread/write repository
cirun tests and builds
fullunrestricted

This prevents unnecessary capabilities from being granted.


Workspace Isolation

The agent should only modify files within a defined workspace.

Example:

/workspace/project/

Attempts to access paths outside the workspace should be rejected.

Example blocked path:

/etc/passwd

Workspace isolation prevents accidental system modifications.


Filesystem Guardrails

Filesystem operations must be validated.

Checks include:

  • path normalization
  • workspace boundary validation
  • protected file detection

Example protected files:

.git/config
.env
secrets.json

The agent should not modify sensitive files unless explicitly allowed.


Git Safety

Version control operations require special safeguards.

Key protections include:

Protected Branches

The agent should not push directly to protected branches.

Example:

main
production

Instead, the agent should create feature branches.

Example:

agent/cache-implementation

Commit Review

Optionally, commits may require human approval before merging.

Change Limits

Large diffs may require additional validation.

Example:

Maximum files changed: 50

This prevents large accidental modifications.


Shell Command Restrictions

Shell execution is one of the most dangerous capabilities.

The system should enforce restrictions.

Examples:

Allowed commands:

npm test
bun test
cargo build

Blocked commands:

rm -rf /
shutdown
reboot

Command validation may use:

  • allowlists
  • blocklists
  • command pattern matching

Execution Sandboxing

Some operations should run in isolated environments.

Examples include:

  • running tests
  • executing build scripts
  • running user code

Sandboxing methods may include:

  • containers
  • restricted shells
  • temporary directories

This limits the impact of unsafe code execution.


Iteration Limits

Agent loops must have limits to prevent runaway execution.

Example configuration:

maxIterations: 50
maxRuntime: 10 minutes

If limits are exceeded, execution should stop.

The agent may then:

  • summarize progress
  • request human assistance

Failure Detection

The system should detect repeated failures.

Example signals:

  • repeated test failures
  • repeated build failures
  • repeated tool errors

Example policy:

If the same failure occurs 3 times,
pause execution and request review.

This prevents infinite error loops.


Destructive Action Detection

Some actions may have irreversible consequences.

Examples include:

  • deleting large numbers of files
  • force-pushing Git branches
  • overwriting configuration files

The system should detect these actions and require confirmation.

Example:

Deleting 200 files detected.
Human approval required.

Rate Limiting

The system may limit the frequency of certain operations.

Examples:

  • tool execution
  • repository modifications
  • external API requests

Rate limiting helps prevent runaway automation.


Logging and Auditing

All actions performed by the agent must be logged.

Important log fields include:

  • timestamp
  • tool invoked
  • input parameters
  • result
  • error messages

Logs allow developers to audit agent behavior.

Example log entry:

[2026-03-09 12:14:03]

tool: write_file
path: src/cache/CacheClient.ts
result: success

Human Approval Workflow

Some operations may require explicit approval.

Examples include:

  • large refactors
  • dependency upgrades
  • production configuration changes

Example workflow:

Agent proposes change

Human review

Approval granted

Agent executes change

This maintains human control over high-risk operations.


Emergency Stop

The system must support immediate termination of agent execution.

Example triggers:

  • user interrupt
  • safety violation
  • resource exhaustion

An emergency stop should immediately halt:

  • tool execution
  • agent loop iterations
  • background processes

Observability

The safety system should expose metrics and alerts.

Examples include:

  • number of blocked actions
  • permission violations
  • safety-triggered pauses

These metrics help monitor system health.


Future Improvements

Possible enhancements include:

  • machine-learned risk detection
  • dynamic permission adjustment
  • anomaly detection in agent behavior
  • collaborative human-agent review systems

These improvements may further reduce operational risk.


Summary

The Agent Safety Architecture protects the development environment from unintended actions by the AI Dev Agent.

Key mechanisms include:

  • permission control
  • workspace isolation
  • filesystem guardrails
  • Git safety policies
  • shell command restrictions
  • sandboxed execution
  • iteration limits
  • human approval workflows

By combining these layers, the system allows autonomous development while maintaining safety and control.

Autonomous Engineer Documentation