Evidence-Based Development: Proving What AI Actually Built

The Gap Between Discussion and Implementation

You spend an hour discussing authentication with your AI assistant. Great conversation. Solid decisions.

A week later, you're asked: "What did we actually implement from that discussion?"

You look at the code. You look at the chat history. There's no connection between them.

Discussion ≠ Implementation

The AI conversation says you decided on JWT. But did you implement it? Fully? Partially? Correctly?

Why Evidence Matters

Scenario 1: Code review

Reviewer: "Why is this approach used instead of the alternative?"

You: "We discussed it in AI chat... somewhere... let me find it..."

Scenario 2: Bug investigation

Team: "When did this behavior get introduced?"

You: "I was working with AI on it, but I'm not sure which changes came from that session."

Scenario 3: Audit/compliance

Auditor: "Can you prove this code was properly reviewed?"

You: "The AI and I went through it, but there's no record..."

Without evidence, AI-assisted development is a black box.

What Evidence Tracking Provides

An evidence system creates a traceable link between:

AI Conversations → File Changes → Code State

Every discussion can be connected to:

What files were modified during/after
What the content looked like before/after
When changes happened relative to the conversation
Confidence level of the connection

The Evidence Pack

Here's what a comprehensive evidence system captures:

File Events (activity.jsonl)

{
"t": "2026-02-04T10:30:00Z",
"kind": "save",
"path": "src/auth/jwt.ts",
"sha256": "a1b2c3...",
"lines": 145,
"status": "verified"
}

Every create, save, delete, rename — timestamped and hashed with SHA-256. File status is tracked (verified, exists_no_hash, binary_skipped, too_large).

Work Sessions (sessions.jsonl)

{
"date": "2026-02-04",
"start": "09:17",
"end": "15:36",
"duration_h": 6.3,
"events": 338
}

Sessions are detected with 6-hour gap analysis — showing when you actually worked, not just when files changed.

Causal Links (causal_links.jsonl)

{
"chat_ref": "composer_abc123",
"file": "src/auth/jwt.ts",
"delay_ms": 45000,
"confidence": {
  "level": "likely",
  "reason": "File modified 45s after JWT discussion"
}
}

The connection between conversation and code change.

Burst Patterns (burst_stats.jsonl)

{
"type": "feature",
"files": ["jwt.ts", "auth.ts", "middleware.ts"],
"duration_min": 45,
"intensity": "high"
}

Detected activity patterns (feature work, bug fix, refactor).

Hot Files: The Implementation Spine

One of the most valuable evidence outputs is hot files:

HOT FILES (most modified during capture)
1. src/auth/jwt.ts (23 saves)
2. src/middleware/auth.ts (18 saves)
3. src/types/auth.d.ts (12 saves)
4. src/tests/auth.test.ts (9 saves)

These are the files that actually changed. Compare to topics discussed:

DISCUSSED TOPICS:
✓ JWT implementation [hot: jwt.ts]
✓ Middleware setup [hot: auth.ts]
✓ Type definitions [hot: auth.d.ts]
? Rate limiting [no hot files — not implemented!]

Gap detection: You discussed rate limiting but there's no evidence it was implemented. That's a potential gap in your work.

Evidence for Integrity

Beyond tracking, evidence enables verification:

Checksum (SHA-256)

Snapshot Checksum: sha256:abc123def456...

Proves the snapshot wasn't modified after generation. The content is what it was when captured.

What checksums prove:

The snapshot is exactly what was generated
No accidental modifications occurred
Content integrity for audits

What checksums don't prove:

That the content is "correct"
That decisions were good
That implementation matches discussion

Integrity ≠ Truth. Checksums verify the snapshot is unchanged, not that it's right.

Using Evidence in Practice

Code Reviews:

"Looking at the evidence pack, I can see jwt.ts was heavily 
modified during the auth discussion (23 saves, high confidence 
causal link). The changes align with decisions D3 and D4."

Debugging:

"The burst analysis shows a 'fix' pattern on 2026-02-01 touching 
these 3 files. That's when the bug was introduced. Let me check 
what was being discussed..."

Onboarding:

"Here's the evidence from the auth implementation. You can see 
which files were created, in what order, and how they connect 
to the architectural decisions."

Auditing:

"Every code change during AI-assisted development is logged with 
timestamps and hashes. We can reconstruct exactly what happened, 
when, and connect it to the conversation that prompted it."

Evidence Limitations (Be Honest)

Evidence tracking has boundaries:

Budgeted collection:

For performance, evidence has configurable limits:

`maxFilesToHash`: 200 files (configurable up to 1000)
`maxBytesToHash`: 16MB (configurable up to 64MB)
`maxSingleFileBytes`: 2MB per file (configurable up to 8MB)

Large codebases may have partial coverage, but the most active files are always captured.

Correlation ≠ Causation:

A file changing after a conversation doesn't prove the conversation caused it. Confidence levels (certain/likely/possible/none) reflect this uncertainty.

Now captures more than before:

RL4 v2.0 captures git commits (`commits.jsonl`), terminal commands (`cli_history.jsonl`), and IDE activity beyond just file saves.

Local-first, cloud optional:

Evidence stays on your machine in `.rl4/`. Optional Supabase sync for backup and cross-device access.

Building an Evidence Culture

For teams adopting evidence-based development:

Start with hot files:

The simplest win. "What files actually changed?" answers most questions.

Add causal links gradually:

Connect conversations to changes. Even low-confidence links are useful.

Review evidence in standups:

"Based on evidence, we made progress on auth (high activity) but haven't touched the rate limiter (no activity)."

Include in code reviews:

"PR includes evidence pack. You can see the AI conversation that led to these changes."

The Future of Traceable AI Development

As AI becomes central to development, traceability becomes essential:

**Compliance requirements** will demand proof of AI involvement
**Team coordination** will need shared evidence
**Quality assurance** will verify AI suggestions were implemented correctly

The developers and teams who build evidence habits now will be ready.

Start Collecting Evidence

Your AI-assisted development should be traceable. Not for bureaucracy, but for clarity. This AI audit trail approach builds on understanding context lossunderstanding context loss/cursor/blog/cursor-context-loss-killing-productivity and the hidden cost of losing itthe hidden cost of losing it/cursor/blog/hidden-cost-context-loss-ai-development.

**Try RL4 Snapshot**Try RL4 Snapshot/cursor/form — full development evidence tracking included. Know what was discussed, what was implemented, and how they connect with AI code tracking.

Proof of implementation beats memory every time.