Skip to content

Agent workflow checkpoint and recovery runtime

ZapTrace agent workflows can be long-running and failure-prone. The runtime checkpoint contract records enough state to pause, resume, recover, and explain decisions without mutating design files implicitly.

Implemented files:

  • zaptrace/agent/workflow.py: checkpoint schema, step records, failure kinds, resume/abort/rollback recovery helpers.
  • examples/agent-runtime/checkpoint.json: example failed validation checkpoint.
  • examples/agent-runtime/resumed-workflow-log.json: example recovery decision log.
  • tests/test_agent_workflow_runtime.py: recovery suite covering timeout, failed tool, validation-gate failure, user abort, resume, rollback, and audit recording.

Every mutating step can carry transaction_id, diff_summary, rollback_id, and rollback_available. Resume does not discard proof evidence from previous steps. Recovery decisions can emit structured audit events through the existing security policy audit model.

Non-claims:

  • The checkpoint contract does not execute autonomous agents by itself.
  • Resume does not bypass ERC/DRC/DFM or release-export gates.
  • Human approval is still required for approved commit and release export operations.