Demo
This demo shows AICTX as a repo-local continuity runtime for coding agents.
It is not a synthetic benchmark whose only goal is to minimize total token count. The demo is designed to make continuity visible, repeatable, and inspectable: Work State, resume capsules, execution contracts, contract compliance, RepoMap hints, Failure Memory, and real usage telemetry.
Current documented target: AICTX 5.3.x.
What this demo is meant to prove
AICTX should help a coding agent start with better operational context instead of repeatedly rediscovering the repository from scratch.
The expected signal is:
- the agent gets a compact startup brief through
aictx resume; - the active task state survives across sessions;
- the next action, active files, known risks, and recommended commands are explicit;
- prior failures can be reused as warnings;
- RepoMap can point the agent toward relevant files and symbols;
- execution contracts make the expected route inspectable;
- finalize stores factual continuity for the next run;
- real usage reports provide evidence instead of relying on vibes.
The demo should be evaluated as a continuity and reliability demo first, and as a token-efficiency demo second.
What this demo does not prove
This demo does not claim that AICTX always reduces total tokens.
In the measured demo runs, AICTX produced useful continuity signals, but total token usage could increase when the agent spent extra context on runtime orientation, generated artifacts, instrumentation, or benchmark-specific prompt text.
That is an important result, not a failure:
AICTX is not primarily a raw token compressor.
AICTX is a continuity layer that makes state, next actions, failures, and execution evidence reusable across coding-agent sessions.
The right question is not only “did total tokens go down?” but also:
- did the agent reach the first useful edit faster?
- did it touch fewer irrelevant files?
- did it repeat fewer failed paths?
- did it preserve active task state correctly?
- did the next session know where to continue?
- did it follow the expected execution contract?
- did the final summary become useful input for future work?
Demo conclusions from the current analysis
The current demo analysis showed a consistent but nuanced signal.
Positive signal
AICTX improved the first session in practical terms:
- less time spent orienting;
- fewer exploratory commands;
- roughly flat uncached input despite higher total context in the first measured run;
- clearer initial route through the repository;
- more explicit next actions and test expectations.
AICTX also made the second session more focused:
- fewer files edited;
- better preservation of task intent;
- less risk of drifting into unrelated cleanup;
- clearer reuse of execution evidence from the previous run.
Caveat
Overall token and uncached-token usage may still increase if the agent consumes too much AICTX runtime detail, benchmark schema, or instrumentation text.
This is especially visible when the user prompt passed to the agent contains more than the task goal: reporting requirements, metrics instructions, JSON schema, demo instrumentation, or analysis rules.
The demo therefore separates the task goal from benchmark instrumentation.
Measured demo data
The following tables summarize the measured demo run stored in the external demo repository:
https://github.com/oldskultxo/aictx-demo-taskflow/tree/main/.demo_metrics
Within that external repo, the measured artifacts live under:
.demo_metrics/with_aictx/
.demo_metrics/without_aictx/
The comparison uses the same model family recorded by Codex for the measured sessions: gpt-5.4.
Important interpretation rule:
The data below is evidence from one controlled demo pair.
It is useful as a directional product signal, not as a universal benchmark claim.
Operational footprint
| Session | Metric | Without AICTX | With AICTX | Delta with AICTX |
|---|---|---|---|---|
| Session 1 | Files explored | 9 | 8 | -1 (-11.1%) |
| Session 1 | Files edited | 6 | 5 | -1 (-16.7%) |
| Session 1 | Commands run | 11 | 11 | 0 (0.0%) |
| Session 1 | Test commands | 2 | 2 | 0 (0.0%) |
| Session 1 | Time to complete | 1’43’’ | 1’34’’ | -9s (-8.7%) |
| Session 2 | Files explored | 10 | 5 | -5 (-50.0%) |
| Session 2 | Files edited | 3 | 1 | -2 (-66.7%) |
| Session 2 | Commands run | 15 | 8 | -7 (-46.7%) |
| Session 2 | Test commands | 4 | 1 | -3 (-75.0%) |
| Session 2 | Exploration steps before first edit | 15 | 6 | -9 (-60.0%) |
| Session 2 | Time to complete | 1’59’’ | 1’12’’ | -47s (-39.5%) |
Session 1 shows a modest improvement, not a dramatic one. That is expected: the first run is where AICTX itself has to establish continuity, and the agent still needs to understand the task.
Session 2 is the stronger signal. The AICTX run explored half as many files, edited one third as many files, ran fewer commands, reached the relevant parser test directly, and finished faster.
First relevant file and first edit
| Session | Run | First relevant file | First edited file | Interpretation |
|---|---|---|---|---|
| Session 2 | Without AICTX | README.md |
src/taskflow/parser.py |
The agent started from broader rediscovery and then moved into implementation. |
| Session 2 | With AICTX | tests/test_parser.py |
tests/test_parser.py |
The agent started directly at the validation surface for the pending parser edge cases. |
This is the most important qualitative result of the demo. AICTX did not merely reduce a count; it changed where the second session began.
Token usage, session totals
These token numbers come from Codex token_count events captured in the session JSONL and summarized in the external .demo_metrics dataset linked above.
| Session | Run | Input tokens | Cached input | Uncached input | Output tokens | Total tokens |
|---|---|---|---|---|---|---|
| Session 1 | Without AICTX | 192,685 | 156,672 | 36,013 | 3,958 | 196,643 |
| Session 1 | With AICTX | 243,972 | 207,360 | 36,612 | 3,523 | 247,495 |
| Session 2 | Without AICTX | 290,222 | 257,152 | 33,070 | 5,935 | 296,157 |
| Session 2 | With AICTX | 205,772 | 173,568 | 32,204 | 2,698 | 208,470 |
| Combined | Without AICTX | 482,907 | 413,824 | 69,083 | 9,893 | 492,800 |
| Combined | With AICTX | 449,744 | 380,928 | 68,816 | 6,221 | 455,965 |
Token deltas
| Scope | Metric | Delta with AICTX | Interpretation |
|---|---|---|---|
| Session 1 | Total tokens | +50,852 (+25.9%) | First run spent more total context with AICTX. |
| Session 1 | Uncached input | +599 (+1.7%) | Uncached input was almost flat despite higher total/cached context. |
| Session 1 | Output tokens | -435 (-11.0%) | The AICTX run produced less output. |
| Session 2 | Total tokens | -87,687 (-29.6%) | The continuity run was materially cheaper in total tokens. |
| Session 2 | Uncached input | -866 (-2.6%) | Uncached input improved slightly. |
| Session 2 | Output tokens | -3,237 (-54.5%) | The second AICTX run required much less generated output. |
| Combined | Total tokens | -36,835 (-7.5%) | Across both sessions, AICTX used fewer total tokens. |
| Combined | Uncached input | -267 (-0.4%) | Uncached input was essentially flat overall. |
| Combined | Output tokens | -3,672 (-37.1%) | The clearest token reduction was in output verbosity. |
The token data should be read carefully. The first AICTX session used more total tokens, mostly because the agent had more runtime and continuity context available. The second AICTX session was substantially more focused. Across the two-session demo, AICTX reduced total tokens by about 7.5%, but the stronger claim is not raw token compression. The stronger claim is better continuity: fewer files explored, fewer files edited, fewer commands, and a sharper second-session starting point.
Cost reference from recorded reports
The demo metrics also include a standard API-style cost reference calculated from the recorded token usage.
| Session | Run | Credits, standard | API USD reference |
|---|---|---|---|
| Session 1 | Without AICTX | 14.5063 | $0.5803 |
| Session 1 | With AICTX | 17.8654 | $0.7146 |
| Session 2 | Without AICTX | 21.9717 | $0.8789 |
| Session 2 | With AICTX | 14.9573 | $0.5983 |
| Combined | Without AICTX | 36.4780 | $1.4591 |
| Combined | With AICTX | 32.8227 | $1.3129 |
This cost table is only a reference derived from the script’s mapped rates. It should not be presented as exact ChatGPT/Codex billing behavior.
Behavioral summary
| Area | Without AICTX | With AICTX | Demo signal |
|---|---|---|---|
| Session 1 orientation | Broad repo search and direct inspection | AICTX resume first, then targeted inspection | AICTX gives an explicit continuity entry point but still has first-run setup overhead. |
| Session 2 orientation | Starts from README / broad discovery | Starts from tests/test_parser.py |
AICTX preserves the pending work surface. |
| Edit scope | Session 2 edits implementation and tests | Session 2 edits only parser tests | AICTX produced a narrower second-session edit. |
| Command volume | Session 2 runs 15 commands | Session 2 runs 8 commands | AICTX reduced rediscovery and verification churn. |
| Token profile | Higher Session 2 total and output tokens | Lower Session 2 total and output tokens | Continuity helps most after state exists. |
Practical conclusion from the measured demo
The measured demo supports this narrower, defensible claim:
AICTX improves second-session continuity.
In the measured two-session task, the AICTX run resumed from a more relevant file,
explored fewer files, edited fewer files, ran fewer commands, completed faster,
and used fewer total tokens across the full two-session comparison.
It does not support the broader claim that AICTX always reduces token usage on every run.
The better product framing is:
AICTX is not a generic token compressor.
AICTX is a repo-local continuity layer that can reduce rediscovery,
narrow the agent's working set, and make later sessions more focused.
Important demo protocol
To keep the demo fair and avoid contaminating the repository, follow these rules.
Keep demo metrics outside the target repo
Do not run the demo with .demo_metrics present inside the target repository under test.
Prompt files, raw metrics, comparison outputs, and benchmark notes should live outside the target repo.
For the measured example documented here, those metrics were kept in a separate external demo repository:
https://github.com/oldskultxo/aictx-demo-taskflow/tree/main/.demo_metrics
Recommended pattern:
/tmp/aictx-demo-runs/
baseline/
with-aictx/
metrics/
prompts/
The repository should contain only the project under test and AICTX runtime artifacts generated by normal AICTX usage.
Compare matched environments only
Only compare runs when the model and execution environment are matched.
For example, do not compare a baseline run from one model against an AICTX run from another model. If the demo is run with gpt-5.4, both baseline and AICTX runs should use gpt-5.4.
Use --task for normal agent startup
Normal supported startup should use --task, and the value should contain only the task goal:
aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool
Do not pass benchmark instructions, metrics schemas, reporting rules, or final-answer requirements as the task goal.
Bad demo input:
aictx resume --repo . --task "Fix parser test. Also collect these metrics, follow this benchmark schema, output this report..." --json
Good demo input:
aictx resume --repo . --task "fix parser test" --json
In v6, --task is the startup input; demo instrumentation should not be mixed into the agent-facing task.
AICTX v6 uses --task for agent startup, so benchmark/reporting instructions should stay outside the task goal.
Global memory vs latest execution
A key architectural conclusion from the demo analysis:
The latest execution must update global repo memory.
It must not replace global repo memory.
AICTX should not become a tool that only remembers the last run.
The latest execution summary is valuable because it captures what just happened, but it should be treated as one event in a larger continuity model.
The durable continuity model should preserve:
- architecture memory;
- repo conventions;
- explicit decisions;
- failure memory;
- strategy memory;
- task history;
- user/project preferences;
- relevant files and entry points;
- latest execution summary.
The right mental model is:
Repo global knowledge
+ latest execution event
+ task-specific Work State
+ observed failures
+ successful strategies
+ structural RepoMap hints
= useful startup continuity
The latest execution should influence the next session strongly, but it must not erase older decisions, known failures, conventions, or architectural context.
Install and initialize
From inside the target repository:
pip install aictx
aictx install
aictx init
Optional version check:
aictx --version
Explicit initialization form:
aictx init --repo .
For a demo script that should not register extra local state, you can use:
aictx init --repo . --yes --no-register
After initialization, the normal product experience should remain:
install -> init -> use your coding agent
The user should not need to manually drive many AICTX commands during normal development.
Demo flow
1. Show startup continuity
Run:
aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool
Look for:
startup_banner_text
capsule
execution_contract
contract_checks
recommended_starting_points
previous_contract_result
The important product point is that the agent receives one operational brief instead of having to inspect all AICTX internals manually.
2. Create visible Work State
Start a task:
aictx task start "Fix token refresh loop" --json
Update it with operational details:
aictx task update --json --json-patch '{"current_hypothesis":"refresh replay happens before persisted token update","active_files":["src/api/client.ts"],"next_action":"inspect interceptor ordering","recommended_commands":["pytest -q tests/test_auth.py"]}'
Inspect continuity:
aictx next
Then resume:
aictx resume --repo . --task "continue token refresh work" --json | python3 -m json.tool
Expected result:
- the active task is visible;
- the current hypothesis is preserved;
- the next action is explicit;
- active files are suggested;
- recommended commands are available to the agent.
3. Show execution contract and compliance
Normal supported agent startup should use one continuity command:
aictx resume --repo . --task "fix parser test" --json | python3 -m json.tool
Look for the compact operational route:
execution_contract.first_action
execution_contract.edit_scope
execution_contract.test_command
execution_contract.finalize_command
contract_checks
After an execution is finalized with observable files, commands, and tests, the final summary can include:
Contract: followed.
or a compact partial, violated, or not-evaluated line.
Inspect contract compliance history and aggregates:
cat .aictx/metrics/contract_compliance.jsonl 2>/dev/null || true
aictx report real-usage
Run another resume and look for the compact previous-contract signal:
aictx resume --repo . --task "next parser task" --json | python3 -m json.tool
The next resume may include:
previous_contract_result
and the Markdown capsule may include:
Previous contract: followed.
4. Show RepoMap
Install with RepoMap support:
pip install "aictx[repomap]"
aictx install --with-repomap --yes
aictx init --repo . --yes --no-register
Inspect status:
aictx map status
Query structural hints:
aictx map query "work state"
aictx map query "startup banner"
aictx map query "contract compliance"
The demo point is not that RepoMap replaces normal code search. The point is that it gives the agent a structural starting point before it spends tokens rediscovering the repo.
5. Show failure capture
Simulate a failed execution:
aictx internal run-execution --repo . --task "run typecheck" --agent-id demo --json -- python -c "import sys; print('src/app.ts(4,7): error TS2322: Type mismatch', file=sys.stderr); sys.exit(1)"
Inspect captured failure memory:
cat .aictx/failure_memory/failure_patterns.jsonl
aictx report real-usage
Expected result:
- the failure is stored as structured memory;
- future startup context can warn the agent about known failure patterns;
- the failure becomes reusable continuity instead of being lost in chat history.
6. Show real usage report
Run:
aictx report real-usage
Use the report to discuss:
- execution count;
- contract compliance;
- observed failures;
- files touched;
- command/test evidence;
- whether the next session has useful continuity.
Do not overclaim from one metric. Total tokens alone can be misleading when the prompt includes benchmark instrumentation.
Recommended benchmark shape
A useful demo compares at least two matched runs.
Run A: baseline agent, no AICTX continuity.
Run B: same agent/model/environment, AICTX initialized and used through normal resume/finalize flow.
Recommended metrics:
| Metric | Why it matters |
|---|---|
| Time to first useful edit | Measures orientation overhead |
| Number of exploratory commands | Shows repo rediscovery cost |
| Number of files inspected | Shows whether search is focused |
| Number of files edited | Shows edit scope discipline |
| Repeated failed commands | Shows whether Failure Memory helps |
| Tests run | Shows verification behavior |
| Contract result | Shows whether the agent followed the expected route |
| Uncached tokens to useful action | Better than total tokens for startup efficiency |
| Total tokens | Useful, but not sufficient alone |
| Quality of next-session summary | Measures continuity value |
The strongest expected proof is not “AICTX always uses fewer tokens”.
The strongest expected proof is:
The next session starts with more useful state, fewer blind spots, and a clearer operational route.
How to interpret demo results
Strong result
A strong AICTX run looks like this:
- the agent starts from
aictx resume; - it identifies the relevant files sooner;
- it avoids unrelated exploration;
- it follows or partially follows the execution contract;
- it runs the expected test or explains why not;
- finalize records what happened;
- the next resume includes useful prior evidence;
- the next session edits are narrower or more targeted.
Weak result
A weak or contaminated run looks like this:
- the task goal includes benchmark instructions;
.demo_metricsor demo scaffolding is present in the repo;- the agent edits metrics files instead of project files;
- the agent spends too much time interpreting the demo harness;
- total tokens increase because instrumentation polluted the task;
- the next resume overweights the latest execution and loses broader repo context.
Product interpretation
If the run improves orientation, focus, and continuity but does not reduce total tokens, the correct conclusion is:
AICTX is already useful as an operational continuity layer.
Further work should reduce startup verbosity and separate instrumentation from task context.
Demo checklist
Before running:
- Same model for baseline and AICTX runs.
- Same repo state for baseline and AICTX runs.
.demo_metricsabsent from the target repo.- Prompt/metrics files stored outside the repo.
aictx --versionrecorded.- Task goal separated from benchmark/reporting instructions.
aictx resume --task "<task goal>"used for normal startup.
During the run:
- Capture time to first useful edit.
- Capture exploratory commands.
- Capture files inspected and edited.
- Capture tests/commands executed.
- Capture repeated failures.
- Capture contract result.
- Capture final AICTX summary.
After the run:
- Run
aictx report real-usage. - Inspect
.aictx/continuity/last_execution_summary.md. - Inspect
.aictx/metrics/contract_compliance.jsonlif present. - Confirm the next
aictx resumecontains useful continuity. - Compare matched runs only.
- Interpret total tokens alongside focus, time, command count, and continuity quality.
Current next improvements suggested by the demo
The demo analysis points to these product improvements:
- Prefer
--taskas the documented normal startup interface. - Use
--taskfor agent-facing task goals and keep benchmark/reporting instructions out of startup input. - Reduce startup verbosity so the agent receives less runtime explanation and more operational context.
- Make the distinction clear between latest execution summary and durable global repo memory.
- Ensure latest execution updates global memory instead of replacing it.
- Keep demo metrics outside the repo to avoid contaminating agent behavior.
- Report continuity quality, focus, and repeated-failure avoidance, not only total token count.
Short demo script
Use this for a compact live demo.
# Install and init
pip install aictx
aictx install
aictx init --repo . --yes --no-register
# Version
aictx --version
# Create work state
aictx task start "Fix token refresh loop" --json
aictx task update --json --json-patch '{"current_hypothesis":"refresh replay happens before persisted token update","active_files":["src/api/client.ts"],"next_action":"inspect interceptor ordering","recommended_commands":["pytest -q tests/test_auth.py"]}'
# Resume as an agent would
aictx resume --repo . --task "continue token refresh work" --json | python3 -m json.tool
# Optional RepoMap
pip install "aictx[repomap]"
aictx install --with-repomap --yes
aictx map status
aictx map query "work state"
# Failure memory demo
aictx internal run-execution --repo . --task "run typecheck" --agent-id demo --json -- python -c "import sys; print('src/app.ts(4,7): error TS2322: Type mismatch', file=sys.stderr); sys.exit(1)"
cat .aictx/failure_memory/failure_patterns.jsonl
# Usage and compliance
aictx report real-usage
cat .aictx/metrics/contract_compliance.jsonl 2>/dev/null || true
One-sentence takeaway
AICTX makes coding-agent continuity explicit, repo-local, inspectable, and reusable; the demo shows stronger orientation and session-to-session focus, while also revealing that token savings require careful separation between task context and benchmark instrumentation.