View on GitHub

AICTX

Repo-local continuity runtime for coding agents

Demo

This demo shows AICTX as a repo-local continuity runtime for coding agents.

It is not a synthetic benchmark whose only goal is to minimize total token count. The demo is designed to make continuity visible, repeatable, and inspectable: Work State, resume capsules, execution contracts, contract compliance, RepoMap hints, Failure Memory, and real usage telemetry.

Current documented target: AICTX 5.3.x.


What this demo is meant to prove

AICTX should help a coding agent start with better operational context instead of repeatedly rediscovering the repository from scratch.

The expected signal is:

The demo should be evaluated as a continuity and reliability demo first, and as a token-efficiency demo second.


What this demo does not prove

This demo does not claim that AICTX always reduces total tokens.

In the measured demo runs, AICTX produced useful continuity signals, but total token usage could increase when the agent spent extra context on runtime orientation, generated artifacts, instrumentation, or benchmark-specific prompt text.

That is an important result, not a failure:

AICTX is not primarily a raw token compressor.
AICTX is a continuity layer that makes state, next actions, failures, and execution evidence reusable across coding-agent sessions.

The right question is not only “did total tokens go down?” but also:


Demo conclusions from the current analysis

The current demo analysis showed a consistent but nuanced signal.

Positive signal

AICTX improved the first session in practical terms:

AICTX also made the second session more focused:

Caveat

Overall token and uncached-token usage may still increase if the agent consumes too much AICTX runtime detail, benchmark schema, or instrumentation text.

This is especially visible when the user prompt passed to the agent contains more than the task goal: reporting requirements, metrics instructions, JSON schema, demo instrumentation, or analysis rules.

The demo therefore separates the task goal from benchmark instrumentation.


Measured demo data

The following tables summarize the measured demo run stored in the external demo repository:

Within that external repo, the measured artifacts live under:

.demo_metrics/with_aictx/
.demo_metrics/without_aictx/

The comparison uses the same model family recorded by Codex for the measured sessions: gpt-5.4.

Important interpretation rule:

The data below is evidence from one controlled demo pair.
It is useful as a directional product signal, not as a universal benchmark claim.

Operational footprint

Session Metric Without AICTX With AICTX Delta with AICTX
Session 1 Files explored 9 8 -1 (-11.1%)
Session 1 Files edited 6 5 -1 (-16.7%)
Session 1 Commands run 11 11 0 (0.0%)
Session 1 Test commands 2 2 0 (0.0%)
Session 1 Time to complete 1’43’’ 1’34’’ -9s (-8.7%)
Session 2 Files explored 10 5 -5 (-50.0%)
Session 2 Files edited 3 1 -2 (-66.7%)
Session 2 Commands run 15 8 -7 (-46.7%)
Session 2 Test commands 4 1 -3 (-75.0%)
Session 2 Exploration steps before first edit 15 6 -9 (-60.0%)
Session 2 Time to complete 1’59’’ 1’12’’ -47s (-39.5%)

Session 1 shows a modest improvement, not a dramatic one. That is expected: the first run is where AICTX itself has to establish continuity, and the agent still needs to understand the task.

Session 2 is the stronger signal. The AICTX run explored half as many files, edited one third as many files, ran fewer commands, reached the relevant parser test directly, and finished faster.

First relevant file and first edit

Session Run First relevant file First edited file Interpretation
Session 2 Without AICTX README.md src/taskflow/parser.py The agent started from broader rediscovery and then moved into implementation.
Session 2 With AICTX tests/test_parser.py tests/test_parser.py The agent started directly at the validation surface for the pending parser edge cases.

This is the most important qualitative result of the demo. AICTX did not merely reduce a count; it changed where the second session began.

Token usage, session totals

These token numbers come from Codex token_count events captured in the session JSONL and summarized in the external .demo_metrics dataset linked above.

Session Run Input tokens Cached input Uncached input Output tokens Total tokens
Session 1 Without AICTX 192,685 156,672 36,013 3,958 196,643
Session 1 With AICTX 243,972 207,360 36,612 3,523 247,495
Session 2 Without AICTX 290,222 257,152 33,070 5,935 296,157
Session 2 With AICTX 205,772 173,568 32,204 2,698 208,470
Combined Without AICTX 482,907 413,824 69,083 9,893 492,800
Combined With AICTX 449,744 380,928 68,816 6,221 455,965

Token deltas

Scope Metric Delta with AICTX Interpretation
Session 1 Total tokens +50,852 (+25.9%) First run spent more total context with AICTX.
Session 1 Uncached input +599 (+1.7%) Uncached input was almost flat despite higher total/cached context.
Session 1 Output tokens -435 (-11.0%) The AICTX run produced less output.
Session 2 Total tokens -87,687 (-29.6%) The continuity run was materially cheaper in total tokens.
Session 2 Uncached input -866 (-2.6%) Uncached input improved slightly.
Session 2 Output tokens -3,237 (-54.5%) The second AICTX run required much less generated output.
Combined Total tokens -36,835 (-7.5%) Across both sessions, AICTX used fewer total tokens.
Combined Uncached input -267 (-0.4%) Uncached input was essentially flat overall.
Combined Output tokens -3,672 (-37.1%) The clearest token reduction was in output verbosity.

The token data should be read carefully. The first AICTX session used more total tokens, mostly because the agent had more runtime and continuity context available. The second AICTX session was substantially more focused. Across the two-session demo, AICTX reduced total tokens by about 7.5%, but the stronger claim is not raw token compression. The stronger claim is better continuity: fewer files explored, fewer files edited, fewer commands, and a sharper second-session starting point.

Cost reference from recorded reports

The demo metrics also include a standard API-style cost reference calculated from the recorded token usage.

Session Run Credits, standard API USD reference
Session 1 Without AICTX 14.5063 $0.5803
Session 1 With AICTX 17.8654 $0.7146
Session 2 Without AICTX 21.9717 $0.8789
Session 2 With AICTX 14.9573 $0.5983
Combined Without AICTX 36.4780 $1.4591
Combined With AICTX 32.8227 $1.3129

This cost table is only a reference derived from the script’s mapped rates. It should not be presented as exact ChatGPT/Codex billing behavior.

Behavioral summary

Area Without AICTX With AICTX Demo signal
Session 1 orientation Broad repo search and direct inspection AICTX resume first, then targeted inspection AICTX gives an explicit continuity entry point but still has first-run setup overhead.
Session 2 orientation Starts from README / broad discovery Starts from tests/test_parser.py AICTX preserves the pending work surface.
Edit scope Session 2 edits implementation and tests Session 2 edits only parser tests AICTX produced a narrower second-session edit.
Command volume Session 2 runs 15 commands Session 2 runs 8 commands AICTX reduced rediscovery and verification churn.
Token profile Higher Session 2 total and output tokens Lower Session 2 total and output tokens Continuity helps most after state exists.

Practical conclusion from the measured demo

The measured demo supports this narrower, defensible claim:

AICTX improves second-session continuity.
In the measured two-session task, the AICTX run resumed from a more relevant file,
explored fewer files, edited fewer files, ran fewer commands, completed faster,
and used fewer total tokens across the full two-session comparison.

It does not support the broader claim that AICTX always reduces token usage on every run.

The better product framing is:

AICTX is not a generic token compressor.
AICTX is a repo-local continuity layer that can reduce rediscovery,
narrow the agent's working set, and make later sessions more focused.

Important demo protocol

To keep the demo fair and avoid contaminating the repository, follow these rules.

Keep demo metrics outside the target repo

Do not run the demo with .demo_metrics present inside the target repository under test.

Prompt files, raw metrics, comparison outputs, and benchmark notes should live outside the target repo.

For the measured example documented here, those metrics were kept in a separate external demo repository:

Recommended pattern:

/tmp/aictx-demo-runs/
  baseline/
  with-aictx/
  metrics/
  prompts/

The repository should contain only the project under test and AICTX runtime artifacts generated by normal AICTX usage.

Compare matched environments only

Only compare runs when the model and execution environment are matched.

For example, do not compare a baseline run from one model against an AICTX run from another model. If the demo is run with gpt-5.4, both baseline and AICTX runs should use gpt-5.4.

Use --task for normal agent startup

Normal supported startup should use --task, and the value should contain only the task goal:

aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool

Do not pass benchmark instructions, metrics schemas, reporting rules, or final-answer requirements as the task goal.

Bad demo input:

aictx resume --repo . --task "Fix parser test. Also collect these metrics, follow this benchmark schema, output this report..." --json

Good demo input:

aictx resume --repo . --task "fix parser test" --json

In v6, --task is the startup input; demo instrumentation should not be mixed into the agent-facing task.

AICTX v6 uses --task for agent startup, so benchmark/reporting instructions should stay outside the task goal.


Global memory vs latest execution

A key architectural conclusion from the demo analysis:

The latest execution must update global repo memory.
It must not replace global repo memory.

AICTX should not become a tool that only remembers the last run.

The latest execution summary is valuable because it captures what just happened, but it should be treated as one event in a larger continuity model.

The durable continuity model should preserve:

The right mental model is:

Repo global knowledge
  + latest execution event
  + task-specific Work State
  + observed failures
  + successful strategies
  + structural RepoMap hints
  = useful startup continuity

The latest execution should influence the next session strongly, but it must not erase older decisions, known failures, conventions, or architectural context.


Install and initialize

From inside the target repository:

pip install aictx
aictx install
aictx init

Optional version check:

aictx --version

Explicit initialization form:

aictx init --repo .

For a demo script that should not register extra local state, you can use:

aictx init --repo . --yes --no-register

After initialization, the normal product experience should remain:

install -> init -> use your coding agent

The user should not need to manually drive many AICTX commands during normal development.


Demo flow

1. Show startup continuity

Run:

aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool

Look for:

startup_banner_text
capsule
execution_contract
contract_checks
recommended_starting_points
previous_contract_result

The important product point is that the agent receives one operational brief instead of having to inspect all AICTX internals manually.


2. Create visible Work State

Start a task:

aictx task start "Fix token refresh loop" --json

Update it with operational details:

aictx task update --json --json-patch '{"current_hypothesis":"refresh replay happens before persisted token update","active_files":["src/api/client.ts"],"next_action":"inspect interceptor ordering","recommended_commands":["pytest -q tests/test_auth.py"]}'

Inspect continuity:

aictx next

Then resume:

aictx resume --repo . --task "continue token refresh work" --json | python3 -m json.tool

Expected result:


3. Show execution contract and compliance

Normal supported agent startup should use one continuity command:

aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool

Look for the compact operational route:

execution_contract.first_action
execution_contract.edit_scope
execution_contract.test_command
execution_contract.finalize_command
contract_checks

After an execution is finalized with observable files, commands, and tests, the final summary can include:

Contract: followed.

or a compact partial, violated, or not-evaluated line.

Inspect contract compliance history and aggregates:

cat .aictx/metrics/contract_compliance.jsonl 2>/dev/null || true
aictx report real-usage

Run another resume and look for the compact previous-contract signal:

aictx resume --repo . --task "next parser task" --json | python3 -m json.tool

The next resume may include:

previous_contract_result

and the Markdown capsule may include:

Previous contract: followed.

4. Show RepoMap

Install with RepoMap support:

pip install "aictx[repomap]"
aictx install --with-repomap --yes
aictx init --repo . --yes --no-register

Inspect status:

aictx map status

Query structural hints:

aictx map query "work state"
aictx map query "startup banner"
aictx map query "contract compliance"

The demo point is not that RepoMap replaces normal code search. The point is that it gives the agent a structural starting point before it spends tokens rediscovering the repo.


5. Show failure capture

Simulate a failed execution:

aictx internal run-execution --repo . --task "run typecheck" --agent-id demo --json -- python -c "import sys; print('src/app.ts(4,7): error TS2322: Type mismatch', file=sys.stderr); sys.exit(1)"

Inspect captured failure memory:

cat .aictx/failure_memory/failure_patterns.jsonl
aictx report real-usage

Expected result:


6. Show real usage report

Run:

aictx report real-usage

Use the report to discuss:

Do not overclaim from one metric. Total tokens alone can be misleading when the prompt includes benchmark instrumentation.


A useful demo compares at least two matched runs.

Run A: baseline agent, no AICTX continuity.
Run B: same agent/model/environment, AICTX initialized and used through normal resume/finalize flow.

Recommended metrics:

Metric Why it matters
Time to first useful edit Measures orientation overhead
Number of exploratory commands Shows repo rediscovery cost
Number of files inspected Shows whether search is focused
Number of files edited Shows edit scope discipline
Repeated failed commands Shows whether Failure Memory helps
Tests run Shows verification behavior
Contract result Shows whether the agent followed the expected route
Uncached tokens to useful action Better than total tokens for startup efficiency
Total tokens Useful, but not sufficient alone
Quality of next-session summary Measures continuity value

The strongest expected proof is not “AICTX always uses fewer tokens”.

The strongest expected proof is:

The next session starts with more useful state, fewer blind spots, and a clearer operational route.

How to interpret demo results

Strong result

A strong AICTX run looks like this:

Weak result

A weak or contaminated run looks like this:

Product interpretation

If the run improves orientation, focus, and continuity but does not reduce total tokens, the correct conclusion is:

AICTX is already useful as an operational continuity layer.
Further work should reduce startup verbosity and separate instrumentation from task context.

Demo checklist

Before running:

During the run:

After the run:


Current next improvements suggested by the demo

The demo analysis points to these product improvements:

  1. Prefer --task as the documented normal startup interface.
  2. Use --task for agent-facing task goals and keep benchmark/reporting instructions out of startup input.
  3. Reduce startup verbosity so the agent receives less runtime explanation and more operational context.
  4. Make the distinction clear between latest execution summary and durable global repo memory.
  5. Ensure latest execution updates global memory instead of replacing it.
  6. Keep demo metrics outside the repo to avoid contaminating agent behavior.
  7. Report continuity quality, focus, and repeated-failure avoidance, not only total token count.

Short demo script

Use this for a compact live demo.

# Install and init
pip install aictx
aictx install
aictx init --repo . --yes --no-register

# Version
aictx --version

# Create work state
aictx task start "Fix token refresh loop" --json
aictx task update --json --json-patch '{"current_hypothesis":"refresh replay happens before persisted token update","active_files":["src/api/client.ts"],"next_action":"inspect interceptor ordering","recommended_commands":["pytest -q tests/test_auth.py"]}'

# Resume as an agent would
aictx resume --repo . --task "continue token refresh work" --json | python3 -m json.tool

# Optional RepoMap
pip install "aictx[repomap]"
aictx install --with-repomap --yes
aictx map status
aictx map query "work state"

# Failure memory demo
aictx internal run-execution --repo . --task "run typecheck" --agent-id demo --json -- python -c "import sys; print('src/app.ts(4,7): error TS2322: Type mismatch', file=sys.stderr); sys.exit(1)"
cat .aictx/failure_memory/failure_patterns.jsonl

# Usage and compliance
aictx report real-usage
cat .aictx/metrics/contract_compliance.jsonl 2>/dev/null || true

One-sentence takeaway

AICTX makes coding-agent continuity explicit, repo-local, inspectable, and reusable; the demo shows stronger orientation and session-to-session focus, while also revealing that token savings require careful separation between task context and benchmark instrumentation.