Beam · Learning Hub redesign · wireframes

The agent's monitors are the spine of the hub.

TL;DR · v2 · locks from sparring round 1 Today's Learning Hub is keyed off engineering pillars. The redesign keys it off what the user is trying to achieve. Each axis becomes a monitor the user sets up: it listens to a metric, gathers data, surfaces work, and asks for help when it needs you.

Locked this round: mental model = monitors (was "axes" — sharper metaphor) · A2 configure affordance = centered modal (Variant B picked) · edit pattern = same modal, populated · configuration extended with cost ceiling, signal sources, "I need help" threshold, notifications · Inbox is primary, not alternative — partner to the monitor hub.

Still open: flatter drill-in alternatives · continuous-mode visibility screen · HITL approve flow.
How to read this doc. Eight screens, each drawn as a gray-and-white wireframe. Variants live inside .variation cards — the picked one is oat-tinted and tagged Pick; rejected/carried variants stay visible with their reason. The page chrome (ivory bg, serif headlines, clay accents) is the doc's voice; the wireframes are pure neutral grays — they describe structure, not visual design.
Round 2 · added in this pass Per round-1 feedback the doc now adds: problem statement (#00b) · sub-nav rationale (#01b) · C2 Finding detail with 3 variants (HITL review screen, click-through from inbox) · D1 Continuous monitor visibility · E1 Datasets index · F1 Audit log. Flow labels reorganised: S1–S7A1–F1 (Setup / Operate / Decision / Continuous / Datasets / Audit). Still queued: flatter B2 drill-in variants · A4 modal redraw to match locked A2.

#00 The problem we're solving

The user has an agent in production. They want it to keep getting better against the things they care about — accuracy, cost, latency, robustness, drift, capability.

The mechanism is monitors. Each one:

  • Tracks one metric (accuracy / cost / latency / robustness / …).
  • Knows the user's target (e.g. 95% accuracy by Jul 14).
  • Knows the user's budget (e.g. don't ship anything that pushes cost over $0.10/task).
  • Listens to signals — production tasks, HITL corrections, golden-set replay, synthetic edges.
  • Surfaces work when it has a fix to propose, or escalates when it needs help.

The interface gives the user the tools to do three jobs:

  1. Train — provide labels, corrections, golden examples. (Datasets tab.)
  2. HITL optimize — review monitor findings, approve / reject / rollback. (Findings tab.)
  3. Audit — read the immutable log of every change, signal, and rollback. (Audit tab.)

Each monitor runs in one of two operational modes:

HITL · Approve
Monitor proposes — user decides.
Every change surfaces as a finding in the inbox. User reviews, ships or rejects. Default mode for new monitors. Auto-rollback armed.
Continuous
Monitor auto-ships — escalates only when needed.
Safe changes ship without approval. Anything that breaches the "I need help" threshold (cluster gap, confidence drop, trade-off) escalates to the inbox. Rollback always available.

Why both modes: the trust ladder. Users start in HITL-Approve, see the monitor working well, and graduate it to Continuous when they trust it. Trust is per-monitor — Accuracy can be Continuous while Cost is still HITL.

#02 Locked principles

Every variant in this doc is checked against these.

P1
Goals are the spine.
Page keyed off what the user is trying to achieve, not Beam's internal taxonomy.
P2
Scalable to new monitors & surfaces.
New monitor (drift, safety, capability) or new surface (model swap, graph reorg) doesn't require redesign.
P3
Customization first-class.
Each goal has explicit setup — target, mode, verification, rollback, trade-offs.
P4
Per-monitor trust ladder.
Off → Watch → Approve → Continuous set independently per monitor.
P5
Operational tone.
Beam shell at runtime density. No marketing copy.
P6
Every claim has evidence.
Findings show dataset, validation, deltas, risks.
P7
Reversible by default.
Auto-rollback armed; every shipped change has rollback.
P8
Show what Beam is doing.
Pending + in-flight + recent activity all visible.

#03 Flow map

Hot path (thick stroke) = first-time owner setting up the first goal. Clay nodes = open forks. Dashed edges = loops or the alternative spine.

Entryper-agent › Learning A1 · Empty hubno goals yet"set up first goal" A2 · Configure modalFORK A · 4 variantstarget / mode / verify… A3 · 1-of-4accuracy populatedothers "+ Set up" B1 · Monitor huball 4 monitors live(destination) click + save +3 more loop: add another axis A4 · Edit modalFORK B · 3 variantsclick ⋯ on tile B2 · Monitor detailper-monitor pageKPI · findings click ⋯ Open axis ALTERNATIVE SPINE C1 · Combined Inbox · one unified list of decisionsbriefing · findings (monitor × change-type) · activity · watchingsynthesis: ship as a sub-nav tab alongside Monitors when 3+ findings pile up
Spine decision Cockpit vs Inbox is a real spine fork. They answer different first questions ("am I on track?" vs "what needs my call?"). Doc carries both forward — synthesis at #08 ships them as sub-nav tabs.
A
Flow A · Setup
From empty hub to a configured monitor.
A1 empty hub · A2 configure (modal) · A3 1-of-N saved · A4 edit. The first-time owner path. Lives in the Monitors tab.

A1 · Empty hub — locked

First-run state. The hero CTA replaces today's "Tool optimisation" card; the 4 empty axis tiles replace today's "Test & eval / Agent tuning" Coming-soon pillars. Goals not pillars (P1, P2).

Locked
No goals yet · pick the first axis to set up
Locked
Invoice agentLearning
Search…
⚡ 2.5
Learning hub
Tell Beam what to optimise for. Set up a monitor per metric.
Goals Findings Datasets Tuning Audit
Set what you want this agent to optimise for
A monitor listens to a metric, gathers data, surfaces work, and asks for your help. Pick one to start.
Accuracy
Track how often the agent gets it right.
Most teams start here
Cost
Track $/task & per-tool tokens.
Latency
Track p95/p99; find bottlenecks.
Robustness
Watch silent regressions & clusters.
Answers
"What does Beam want from me to start?"
New components
Hub empty CTA · Monitor tile (empty)
Brief
P1 ✓ P2 ✓

A2 · Configure (setup affordance) — LOCKED on Variant B (modal)

The customization screen that was missing from the dashboard. Same form across every variant — what differs is where the form lives. Locked on Variant B (centered modal) in round 1: "helps focus on configuration." The other variants are kept visible with their reasons so future readers see the alternatives explored.

Variant A
Right-rail drawer (480px) · Beam-native
Reject · was previous lean
Invoice agentLearning
Search…
⚡ 2.5
Learning hub
GoalsFindingsDatasetsTuning
Accuracy
Track…
Cost
Track…
Latency
Track…
Robust.
Watch…
Set up Accuracy goal
Tell Beam how accurate this agent should be, how to verify, when it can act on its own.
Target accuracy *
95% by Jul 14
90% 95% 99% Custom…
Mode
Off
Watch only
Human-approve
Beam proposes, you decide.
Continuous
🔒 30d
Verify against
golden_v3 · 412 ex
Auto-rollback
Revert if -5pt on cluster / 24h
Trade-off rules
No cost > $0.05/task
No p95 > +1s
Editable later.
Bet
Context preservation; Beam-native (Tasks/Flow use drawers).
Weakest at
Vertical cramping for the now-meaty 8-section config form — drawer is too narrow.
Brief
P5 ✓ · Rejected in favour of Variant B: user preferred modal for "focus on configuration." Drawer is reused at A4 for quick edits.
Variant C
Dedicated setup page · evidence side panel
Reject · evidence relocated
Invoice agentLearningSet up Accuracy
Set up Accuracy goal
Tell Beam how accurate this agent should be, how it should verify, when it can act on its own.
TARGET
95% by Jul 14
90%95%99%Custom
MODE
Off
Watch
Approve (rec.)
Continuous
🔒
AUTO-ROLLBACK
Revert if -5pt / 24h
Bet
Live tool scores while configuring — informs target choice (P6).
Weakest at
Click cost; heavy for editing.
Brief
P3 ✓ P6 ✓; P5 mixed (no page-form precedent in Beam)
Variant D
Inline expand — tile replaces itself with form
Reject · ergonomic relocated to A4-B
Invoice agentLearning
Learning hub
GoalsFindingsDatasets
Cost
Set up Accuracy
Target
95%
Mode
OffWatchApproveCont.
Latency
Robust.
Bet
Minimal perceived interruption.
Weakest at
Cramped in 1 tile width; doesn't scale to 4-axis setup back-to-back.
Brief
P3 ✓; P5 mixed (no precedent)

Trade-off table

VariantBest atWeakest atBrief
A · DrawerContext preservation; Beam-nativeCramped for 8-section config formP5 ✓
B · ModalFocus on configuration; room for the meaty formHeavy for "just toggle mode" editsP3 ✓ P7 ✓
C · PageLive evidence side panelClick cost; heavy for editingP3 ✓ P6 ✓; P5 mixed
D · InlineLightest perceived interruptionCramped; doesn't scaleP3 ✓; P5 mixed
Pick · Variant B (Modal) · locked round 1 "Helps focus on configuration." The agent's initial lean was Variant A (drawer) for Beam-nativeness, but the user overrode: with the extended config (now 8 sections — target / mode / cost ceiling / verify / signals / help threshold / rollback / notifications) the form is too meaty for a 480px drawer, and a centered modal forces the user to commit to setting it up before returning to the hub.

Reject-or-relocate: C's evidence panel relocated into the drill-in side panel (B2). D's inline ergonomic carried to A4 as a 1-click mode-pill popover for quick edits. Drawer (A) reused at A4 only for partial edits where the modal feels heavy.

A3 · 1-of-4 (between-state) — locked

Between "I clicked save" and "Beam has data" there's a several-minute gap. The Accuracy tile's "Gathering first measurement…" state is the only signal the goal actually landed in a real measurement loop. P8.

Locked
Accuracy populated, others "+ Set up"
Locked
Invoice agentLearning
Search…
⚡ 2.5
Learning hub
GoalsFindingsDatasetsTuningAudit
Mission
Reach 95% accuracy by Jul 14, 2026.
Add cost / latency / robustness for more leverage.
Owner
Anna S.
Verify
golden_v3
1 of 4 monitors set up. Add cost or latency so Beam can flag trade-offs. Add another →
Accuracy
Approve
—%
/ 95% goal
Gathering first measurement…
Verifygolden_v3 · 412 ex
In flightFirst replay queued · ETA 5m
Cost
Track $/task & tokens.
Latency
Track p95 / p99.
Robustness
Watch silent regress.
Answers
"Did my goal save? What now?"
Key state
"—% · Gathering first measurement… · ETA 5m"
Brief
P8 ✓

A4 · Edit affordance — LOCKED: same modal as setup + inline mode-pill

Edit-existing-configuration uses the same centered modal as A2 setup, populated with current values + an edit-history meta strip. For the most frequent edit (mode change → graduation Approve → Continuous), a 1-click inline mode-pill popover lives on the monitor tile. The wireframe below still shows the right-rail drawer pattern — slated for redraw next round to match the locked A2 modal.

Variant A · needs redraw
Edit modal · same as A2 (currently shown as drawer — redraw pending)
Carry · redraw as modal
Invoice agentLearning
Search…
Learning hub
GoalsFindingsDatasets
Acc.
86%
Cost
$0.34
Lat.
4.1s
Robust
2
Edit Accuracy goal
Now 86% / 95%. ▲ +2.4 pt this week.
Last edit: 3d agoShipped: 6RB: 1
Target
95% by Jul 14
Mode
Off
Watch
Approve (current)
18d stable so far.
Continuous
unlocks in 12d
🔒
Beam: graduate at 30d. You're at 18d.
Verify
golden_v3 · 412 ex
Auto-rollback
Revert -5pt / 24h
Last fired Sat.
Applies to NEW findings.
Bet
Same modal as A2 — populated vs empty. Cognitive model unified.
Weakest at
4-click cost for "toggle mode" — handled by Variant B inline popover below.
Brief
P3 ✓ P7 ✓ · Pending redraw to show modal not drawer.
Variant B · Layer on A
Inline mode-pill popover on the tile
Carry · enhancement
Invoice agentLearning
Learning hub
GoalsFindings
Accuracy
Approve ▾
86%
/ 95%
▲ +2.4 pt
Change mode
Off
Watch
Approve (current)
Continuous
🔒 12d
Cost
Watch
$0.34
Lat.
Watch
4.1s
Robust
Watch
2
Bet
1-click for most common edit (mode graduation Approve → Continuous).
Weakest at
Two patterns to maintain — first-time users hit inconsistency.
Brief
P5 ✓; P3 mixed — only mode is inline, rest goes through drawer
Variant C
Drill-in Settings tab only
Reject
Invoice agentLearningAccuracy
← Goals Accuracy
FindingsActivityWatchingSettings
Goal settings
Target
95%by Jul 14
Mode
Approve
Verify
golden_v3
Rollback
-5pt threshold
Bet
Single source of truth — one place for all axis controls.
Weakest at
3-click depth (hub → drill-in → settings → form) — too slow for mode change.
Brief
P5 ✓; P3 ✗ (not first-class at hub level)
Pick · same modal as A2 + inline mode-pill · locked round 1 Same modal handles both setup and edit — populated with current values, edit-history meta strip at top, mode-graduation hint inline (Variant A). For the most frequent edit — mode change at the trust-ladder moment (Approve → Continuous) — a 1-click inline mode-pill popover lives on the monitor tile (Variant B). Variant C (drill-in Settings tab) rejected: 3-click depth too slow for the most common edit.
B
Flow B · Operate
Live monitors at a glance, drill in when you need to.
B1 monitor hub (4 tiles, mission + summary footer) · B2 monitor detail (per-monitor KPI, scoped findings, side-rail tools). Lives in the Monitors tab.

B1 · Monitor hub (multi-monitor dashboard) — locked · in Figma

Two weeks in. Full mission. All 4 monitors populated. Per-tile opens the edit modal (same as A2).

Locked · destination
All 4 monitors live · summary footer · already at 1322:106 · B1
Locked
Invoice agentLearning
Search…
⚡ 2.5
Learning hub
GoalsFindingsDatasetsTuningAudit
Mission
Reach 95% acc before VW go-live, cost <$0.10, p95 <2s.
Owner
Anna S.
Deadline
Jul 14
Budget
$847 / $1.2k
Accuracy
Approve
86%
/ 95% goal
▲ +2.4 pt
FocusBlocking: extract, match
In flight3 findings waiting
Cost
Watch
$0.34
/ <$0.10
▲ +3% w/w
Budget$847 / $1.2k
In flight1 trade-off pending
Latency
Watch
4.1s
/ <2s p95
▼ −0.6s 7d
B'neckclassify, extract
In flight1 finding ready
Robust.
Watch
2
silent (7d)
▲ new cluster
Golden412 ex · 94% pass
In flight1 cluster decide
Across all goals · 7d: 3 waiting 6 shipped 1 rolled back 4 watching Activity →
Answers
"Am I getting where I'm going on the things I care about?"
Anatomy
4 tiles · identical anatomy · summary footer (not a 5th tile)
Brief
P1 P2 P4 P5 P7 P8 ✓

B2 · Monitor detail (drill-in · Open Accuracy) — locked

Per-monitor full page. KPI + sparkline, scoped findings, side-rail with monitor settings + Focus tools (pulled from live Flow page node-quality). Pending: flatter drill-in variants per round-1 feedback — full-page nav reads as too deep. Three flatter bets to draw next: inline expand-in-place, sticky detail panel under the hub, or "no drill-down" (everything important on the tile).

Locked · full page
/agent/<id>/learning/accuracy
Locked
Invoice agentLearningAccuracy
Search…
⚡ 2.5
← Goals Accuracy ● Approve
86%
Goal: 95% by Jul 14
▲ +2.4 pt · +12 pt 30d
95% goal
Findings (3)Activity (4)Watching (1)Settings
Re-tune extract_amount · date cluster
prompt · 47 ex +0.4 pt
Tune match_vendor · German handwriting
prompt · 21 ex +0.3 pt
12 HITL tasks · stitch_multipage
feeds memory
Answers
"What's actually happening to accuracy?"
Anatomy
KPI + sparkline · tabs · scoped findings · side-panel of settings & focus tools
Brief
P1 P6 P7 ✓
C
Flow C · HITL Decision
The user's queue — review, ship, reject.
C1 combined inbox (one unified list of decisions) · C2 finding detail (the HITL review screen). Primary surface for HITL-Approve mode. Lives in the Findings tab.

C1 · Combined Inbox — primary · locked round 1

Promoted from alternative spine → primary surface. Confirmed in round 1: "I very much like the inbox. This is also not bad, because this is something we got as feedback in the user interview as well. If there is something, just surface it up." The inbox is where the monitors push things that need the user — it's the partner surface to the monitor hub, not a fallback.

Locked · sub-nav tab
Briefing headline · unified decisions list · recent activity
Locked
Invoice agentLearning
Search…
⚡ 2.5
Learning hub
Mode: ● Continuous · you approve high-risk only
InboxGoalsDatasetsTuningAudit
This week · May 1–8
Your agent got +2.4 pts more accurate,
no change in cost, 0.6s faster on p95.
4 auto-shipped · 3 queued for you.
Accuracy
86%
▲ +2.4 pt
Cost
$0.34
— no change
p95
4.1s
▼ −0.6s
Decisions waiting · 3 · oldest 3h
Parallelize classify_doc + extract_amount latency graph reorg #142 · 3h
No data dependency. End node accepts partial failures.
p95 4.1→2.6sacc ─cost ─
Re-tune extract_amount · date cluster accuracy prompt #141 · 18h
47 failures on non-ISO dates. New prompt accepts ISO/DD.MM.YY/MM/DD.
cluster 78→94%overall 86→86.4%
Swap classify_doc · GPT-4o → Haiku cost model swap ⚠ trade #140 · 1d
cost −12%, acc −1.1pt. Beam: don't ship.
cost −12%acc −1.1pt
Recent activity · 4 shipped · 1 rolled back
Mon auto accuracy parse_date · prompt 78→94% on cluster
Sat ↺ cost classify_doc · swap rolled back −11% acc on multi-currency
Answers
"What needs my call right now?"
When it wins
Multi-axis trade-offs (one card carries both deltas); autopilot day
Brief
P6 P7 P8 ✓

Cockpit vs Inbox — when each wins

Use caseCockpit (B1)Inbox (C1)
"Am I on track?"WinsBriefing helps, scroll needed
"What needs my call?"Pending buried in tileWins — list IS the page
Multi-axis trade-offsSpan 2 tiles awkwardlyWins — one card, both deltas
First-time ownerWins — empty axes guide setupEmpty inbox is a dead end
Autopilot dayTiles feel staticWins — Activity is the story
Pick · both ship as primary · locked round 1 Monitor hub and Inbox are partner surfaces, not "main + alternative." Hub is "what am I tracking and how is each monitor doing." Inbox is "what needs my call right now." Both reachable as sub-nav tabs from the per-agent Learning route. Hub's summary footer deep-links to Inbox; Inbox stat tiles link back to Hub per monitor.

C2 · Finding detail — 3 variants · awaiting pick

What happens when the user clicks a finding from the C1 inbox to review it. This is the HITL decision screen — where the user actually says ship / reject. The form is the same in every variant (diff + validation + decision bar); what differs is where the screen lives and how it interrupts the inbox.

Variant A
Full-screen review page · dedicated URL
In play
Invoice agentLearningFindings#142
Search…
← Findings Parallelize classify_doc + extract_amount latency · graph reorg #142 · 3h ago
Proposed change
before
fetch_pdf → classify_doc → extract_amount → end
after
fetch_pdf → ┬ classify_doc ┐ → end
           └ extract_amount ┘
Predicted impact
p95 latency
4.1s → 2.6s
▼ −1.5s
accuracy
86% → 86%
no change
cost/task
$0.34 → $0.34
no change
Validation
golden_v3 · 412 examples · 0 regressions
7-day replay · 1,247 production runs · 0 partial-failure issues
Risk · medium · graph topology change · auto-rollback armed at −5pt
3 more findings waiting · this one is highest impact
Bet
Maximum room for evidence — diff + impact + validation all visible at once.
Weakest at
Click cost. Back-and-forth between inbox & detail = lots of nav.
Brief
P5 ✗ · too deep per your "I don't like clicking too deep" feedback.
Variant B
Slide-over drawer (50% width) · inbox stays visible
Carry · likely pick
Invoice agentFindings
Search…
Inbox · 3 pending
Parallelize classify_doc…#142
latency · graph reorg · 3h
Re-tune extract_amount…#141
accuracy · prompt · 18h
Swap classify_doc…#140 · ⚠
cost · model swap · 1d
Parallelize classify_doc + extract_amount #142 · ✕
latency graph reorg
DIFF
fetch → classify → extract → end
fetch → ┬ classify ┐ → end
       └ extract ┘
IMPACT
p95 4.1s → 2.6s ▼ · acc — · cost —
VALIDATION
✓ golden_v3 · 0 regressions
✓ 7d replay · 1,247 runs
⚠ Risk: medium · auto-rollback armed
Bet
Flat. Inbox stays visible left, decision happens on the right. Closing the drawer drops you back at the inbox with no nav.
Weakest at
Less room for evidence than a full page; reviewer needs to scroll inside the drawer for long validation lists.
Brief
P5 ✓ · Matches your "don't make me click too deep" feedback.
Variant C
Expand-in-place · other inbox rows collapse
In play
Invoice agentFindings
Search…
Findings · 3 pending
#141 Re-tune extract_amount · accuracy · prompt 18h · click to review
Parallelize classify_doc + extract_amount latency graph reorg #142 · 3h
before / after
fetch → classify → extract → end
fetch → ┬ classify ┐ → end
impact
p95 4.1s → 2.6s
acc · cost unchanged
✓ golden_v3 · 0 regress
#140 Swap classify_doc · cost · model swap · ⚠ trade-off 1d · click to review
Bet
No page nav at all. Pure progressive disclosure inside the inbox.
Weakest at
Cramped — the diff + impact need to fit in 1 row's width. Long validation lists scroll the expanded row.
Brief
P5 ✓ · Flattest of the three. P6 mixed — less room for evidence.
Lean Variant B (drawer). Inbox stays visible on the left so you don't lose your place in the queue; the right pane has enough room for the diff + impact + validation without scroll. A and C are the polar ends — A maximises evidence room (but is deep), C maximises flatness (but cramps the evidence). B is the calibrated middle. Awaiting your pick.
D
Flow D · Continuous
Monitor auto-ships, escalates only when needed.
D1 autonomous visibility — what the user sees when no human decision is required. Safe changes ship without approval; anything past the "I need help" threshold escalates to the Findings inbox.

D1 · Continuous mode — autonomous visibility

When a monitor is in Continuous mode, the user isn't blocked — the monitor ships safe changes itself. But the user still wants visibility into what's happening. This is the monitor's surface when no human decision is required.

Single wireframe · concept
Continuous monitor · live activity stream + escalation strip
In play
Invoice agentLearningAccuracy
Search…
← Monitors Accuracy ● CONTINUOUS · autonomous
⚠ Needs your call 1 trade-off finding escalated · classify_doc model swap (cost ↓12% · acc ↓1.1pt)
86%
Goal: 95% · Continuous since Apr 19 (20d)
▲ +2.4 pt 7d · +12 pt 30d
Auto-shipped last 24h
4 changes · 0 rollbacks View 30d
04:12 auto prompt parse_date_field · re-tuned on date cluster 78 → 94% on cluster · 0 reg
02:47 auto prompt match_vendor · German handwriting prompt +0.3 pt overall · 21 ex
Yesterday memory Memory · absorbed 12 HITL corrections → stitch_multipage
Yesterday prompt fetch_pdf · retry policy update -3% transient errors
Signal inflow · 1,247 production runs · 47 HITL corrections · 412 golden replays · last 24h Tune signal sources →
Anatomy
Escalation strip (top) · KPI + sparkline · auto-shipped activity stream · signal inflow footer.
Bet
Even in continuous mode the user wants to glance at what happened. Activity stream is the answer. Escalation strip is the only time the monitor demands attention.
Brief
P7 ✓ P8 ✓ · Every shipped change has rollback inline.
E
Flow E · Datasets
Verification & training data the monitors use.
E1 index — three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config.

E1 · Datasets — verification & training data

The substrate the monitors verify against. Three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config (see A2 modal · Verify field).

Single wireframe · concept
Datasets index · table with sources + usage
In play
Invoice agentLearningDatasets
Search datasets…
Datasets
Beam runs replay validation against these before shipping changes. Build new ones from production runs, upload, or synthetic edges.
MonitorsFindingsDatasetsAudit
All sources Production Upload Synthetic
NameSourceExamplesPass rateUsed by
golden_v3 · curated invoices upload 412 94% Accuracy · Robust.
production_7d · rolling window production 1,247 86% Accuracy · Cost · Lat.
date_format_cluster · failures production 47 78% → 94% Accuracy
multi_currency_edges · synthetic synthetic 200 62% ⚠ Robustness
vw_handwriting · field captures upload 89 72% Accuracy
5 datasets · 2,995 total examples · last updated 4m ago (production_7d auto-refreshes)
Anatomy
Sub-nav · source filter chips · "+ Build dataset" CTA · table with name / source / examples / pass rate / used-by / open
Bet
Show usage. Each dataset says which monitors reference it — that's how the user knows which to curate vs leave alone.
Brief
P6 ✓ · Datasets ARE the evidence substrate.
F
Flow F · Audit
Immutable history — read-only, exportable.
F1 log — every change, every signal, every rollback. The page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"

F1 · Audit log — immutable history

Every change, every signal, every rollback. Read-only. Exportable. Filterable. This is the page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"

Single wireframe · concept
Audit log · filterable event feed with diff preview
In play
Invoice agentLearningAudit
Search by tool / ID…
Audit log
Every change is recorded here · read-only · exportable to CSV / JSONL.
MonitorsFindingsDatasetsAudit
Last 30d By monitor ▾ By type ▾ By actor ▾
WhenMonitorActorActionEffect
04:12 today Accuracy BBeam · auto Ship · prompt rewrite on parse_date_field 78 → 94% cluster
02:47 today Accuracy BBeam · auto Ship · prompt rewrite on match_vendor +0.3 pt overall
Yesterday 18:22 Accuracy ASAnna S. Config edit · mode Approve → Continuous graduation @ 30d stable
Sat 14:33 Cost BBeam · auto Rollback · classify_doc model swap reverted −11% acc on multi-currency
Sat 14:21 Cost MKMarc K. Approve · classify_doc · GPT-4o → Haiku cost −12%
Fri 11:08 Robust. ASAnna S. Signal · new cluster detected: multi-currency 8% of traffic
Showing 6 of 247 events · scroll for more · expand any row (▾) for full diff and signal trail
Anatomy
Sub-nav · filter chips (time / monitor / type / actor) · Export · table with when / monitor / actor / action / effect / expand-for-diff
Bet
Same row shape as the Findings activity table — visual continuity, just more columns and filterable.
Brief
P7 ✓ · Reversibility = visible rollback history; P8 ✓ · everything Beam did, visible. Updated: actor column now shows avatar + name (Anna S., Marc K.) — Beam · auto rows use a filled dark avatar so it's distinguishable at a glance.

#15 Metrics philosophy — accuracy isn't enough

From round-2 sparring: "Accuracy is a good metric, but initially when the agent is trained on a small use case, its accuracy is gonna be high. Then as we increase the scope of the agent… its accuracy will go down, even though the agent became better."

Right. Bare accuracy rewards staying narrow. An agent at 95% on 5 task types looks "better" than one at 80% on 50 — but the second is doing 8× the work. Three ways the UI can honor this:

OptionWhat it doesCostLean
A · Pair accuracy with coverage inlineEvery accuracy number shows what it's a percentage of: 86% · 412 ex · 47 distinct task types · 8% un-categorised. Denominator always visible.Free — just copy changes.Yes — prevents the trap immediately
B · Coverage as a first-class monitorAdd a 5th monitor type. User sets a goal: "handle 80% of production task types · grow by +5 task types/month." Scope expansion becomes a deliberate act, surfaced as findings.Medium — adds one tile + one config option + new finding type.Yes — closes the gap
C · Replace accuracy KPI with a 2-axis chartOn the Accuracy drill-in: scatter of accuracy-per-task-type × task-type-frequency. User sees "95% on common, 60% on long tail."Expensive — needs real-data visualization.Carry · for B2 drill-in next round
Proposed direction Adopt A + B together. A is free and fixes the framing immediately. B promotes Coverage to a first-class monitor — connecting directly to Datasets (via E4, "use this dataset to teach the agent a new task type"). Coverage findings: "new task type detected — multi-currency invoices (8% of traffic, 0 examples in any dataset). Add to scope?" — different glyph from accuracy / cost findings.

How a Coverage monitor reads on the hub

Sketch · proposed addition to A1 / B1
Coverage tile (proposed 5th monitor)
In play
Accuracy APPROVE
86%
/ 95% goal
▲ +2.4 pt 7d
measured on 412 ex · 47 task types · 8% un-categorised
Coverage WATCH
47
task types · 92% of traffic
▲ +5 types 30d
scope gap 8% of traffic in 3 un-known task types · Triage →
Anatomy
Accuracy tile (left) now carries a "measured on" footer with denominator context. Coverage tile (right) is the proposed new monitor — a separate measure of "how much the agent can do."
Bet
Accuracy + Coverage answer two different questions — "how often right?" and "how broadly?" — together they capture the real direction of travel.
Brief
Closes the philosophy gap. Adds 1 monitor type, 1 finding type ("new task type detected"), 1 connection (Coverage → Datasets via E4).

#16 A4 · Edit existing monitor — modal redraw (replaces S5 drawer)

Same centered modal as A2 (locked round 1) — populated with current values, edit-history strip at the top, mode-graduation hint inline, plus a small Delete monitor action in the footer that A2 setup doesn't need.

Locked · modal pattern matches A2
Edit Accuracy monitor · populated
Locked
Invoice agentLearning
Search…
Edit Accuracy monitor
Currently 86% · goal 95% · ▲ +2.4 pt this week.
ASLast edited 3d ago · Anna S. · Shipped since: 6 · Auto-rollbacks: 1
TARGET ACCURACY
95% by Jul 14, 2026
MODE
Off Watch only Human-approve (current) Continuous 🔒 12d
Beam: graduate to Continuous at 30d stable. You're at 18d.
COST CEILING
$0.10/task · hard cap
VERIFY AGAINST
golden_v3 · 412 examples · 94% pass
SIGNAL SOURCES
Production tasks
HITL corrections
Replay vs golden set
Synthetic edge cases
"I NEED HELP" THRESHOLD
Confidence drop > 10pt on any tool
New failure cluster > 5% of traffic
AUTO-ROLLBACK
Revert if -5pt on cluster within 24h
Last fired Sat (classify_doc model swap, multi-currency cluster).
What's new vs A2
Edit-history meta strip (last edited + actor avatar + shipped since + rollbacks) · "Delete monitor" action in footer (destructive, left side) · mode-graduation hint inline · "Save changes" instead of "Save & start listening."
Bet
Same modal pattern as setup — populated vs empty. One mental model, two states.
Brief
P3 ✓ P5 ✓ P7 ✓ · Replaces the stale drawer redraw from S5.

#17 E2 · Dataset detail — examples + usage + pass rate

Click into a dataset from E1. Three sub-tabs: Examples (the rows), Used by (which monitors verify against it), Pass rate (per-monitor accuracy on this dataset over time).

Single wireframe · concept
Dataset detail · default to Examples tab
In play
Invoice agentLearningDatasetsgolden_v3
Search examples…
← Datasets golden_v3 upload · 412 examples
Examples (412)Used by (3 monitors)Pass rate
All 412 Must-pass · 47 Edge cases · 89 Currently failing · 18
IDInput snippetExpectedStatus
#1247 Invoice 4827 · DE · DD.MM.YY date format · €647.50 amount: 647.50 ✓ pass
#1248 VW PO #44831 · German handwriting · multi-line vendor: VW ✗ fail
#1249 Multi-currency · USD $200 + EUR €180 in same line flag for review ✗ fail
#1250 Standard ISO date · $1,243.00 · single line amount: 1243.00 ✓ pass
Showing 4 of 412 · scroll for more 394 pass · 18 fail · 94% rate
Anatomy
3 sub-tabs · filter chips (must-pass / edge / failing) · table of examples with pass/fail status · side-rail with Used by + Provenance
Connects to
E3 · "+ Add examples" CTA → curate flow. E4 · "Use this dataset →" CTA → validate / train / measure flow.
Brief
P6 ✓ · Evidence-substrate is visible.

#18 E3 · Curate — build / add examples to a dataset

Add examples to a dataset from three sources: promote from production runs (most common — user sees a failure in the inbox, decides "we should test for this"), upload JSONL/CSV, or generate synthetic edges. Per example: label, must-pass flag, edge-case flag.

Single wireframe · concept
Curate · promote-from-production flow (most common)
In play
Invoice agentLearningDatasetsgolden_v3Add
Add examples to golden_v3
Curate the dataset that drives Accuracy + Robustness verification.
From production Upload JSONL/CSV Generate synthetic
Filter production runs: Failed (47) Low confidence HITL corrected Multi-currency
IDProduction runOutcomeMark as…
#48,221 Multi-currency · USD $200 + EUR €180 failed · low conf must-pass edge
#48,234 VW PO · German handwriting · MM/DD date failed · HITL fix must-pass edge
#48,290 Mixed currency · GBP £450 standard failed · timeout must-pass edge
2 selected · adding to golden_v3
Bet
The fastest curation path is "I just saw this fail in production · add it to the dataset so it never fails silently again." That's row 1.
Connects to
From C2 (finding detail) · "Add example to dataset" action lands here pre-populated. From inbox / failed tasks · same flow.
Brief
P6 ✓ · evidence-substrate maintained by the user.

#19 E4 · Use this dataset — 3 actions

From E2's "Use this dataset →" CTA. A dataset can be used three different ways:

ActionWhat it doesWhen
ValidateRun replay now — measure pass rate without changing anythingSanity-check after a config change. Or "is this dataset still relevant?"
TrainUse as a signal source for a monitor — Beam learns from these examples going forwardYou've added new must-pass examples and want the agent to internalize them.
MeasureCalculate per-dataset metrics on the relevant monitor (shows up as a sub-section in B2 drill-in)"How is my agent doing on multi-currency specifically?"
Single wireframe · concept
"Use this dataset" action picker
In play
Invoice agentDatasetsgolden_v3Use
Use golden_v3 · 412 examples · 47 task types
Pick how this dataset should drive the agent.
VALIDATE
Run replay now
Measure pass rate today against the current agent. No agent changes.
ETA · 2 min
TRAIN
Use as a signal source
Add this dataset to a monitor's signal sources — the agent learns from these going forward.
Target monitor · Accuracy ▾
MEASURE
Track as a sub-metric
Surface accuracy on this specific dataset inside the monitor's drill-in (e.g. "accuracy on multi-currency"). Continuous tracking.
Refresh · daily
Already in use: Validate (3 monitors) · Train (Accuracy) · Measure (Accuracy > multi-currency)
Bet
Three cards = three distinct jobs. User picks based on intent, not action menu.
Closes the loop
This is the answer to "what do I do with these datasets?" — datasets are inert until you Validate, Train, or Measure with them.
Brief
P3 ✓ P6 ✓ · Datasets become first-class inputs to the monitor.