TL;DR · v2 · locks from sparring round 1
Today's Learning Hub is keyed off engineering pillars. The redesign keys it off what the user is trying to achieve. Each axis becomes a monitor the user sets up: it listens to a metric, gathers data, surfaces work, and asks for help when it needs you.
Locked this round: mental model = monitors (was "axes" — sharper metaphor) · A2 configure affordance = centered modal (Variant B picked) · edit pattern = same modal, populated · configuration extended with cost ceiling, signal sources, "I need help" threshold, notifications · Inbox is primary, not alternative — partner to the monitor hub.
How to read this doc. Eight screens, each drawn as a gray-and-white wireframe. Variants live inside .variation cards — the picked one is oat-tinted and tagged Pick; rejected/carried variants stay visible with their reason. The page chrome (ivory bg, serif headlines, clay accents) is the doc's voice; the wireframes are pure neutral grays — they describe structure, not visual design.
Round 2 · added in this pass
Per round-1 feedback the doc now adds: problem statement (#00b) · sub-nav rationale (#01b) · C2 Finding detail with 3 variants (HITL review screen, click-through from inbox) · D1 Continuous monitor visibility · E1 Datasets index · F1 Audit log. Flow labels reorganised: S1–S7 → A1–F1 (Setup / Operate / Decision / Continuous / Datasets / Audit). Still queued: flatter B2 drill-in variants · A4 modal redraw to match locked A2.
The user has an agent in production. They want it to keep getting better against the things they care about — accuracy, cost, latency, robustness, drift, capability.
Audit — read the immutable log of every change, signal, and rollback. (Audit tab.)
Each monitor runs in one of two operational modes:
HITL · Approve
Monitor proposes — user decides.
Every change surfaces as a finding in the inbox. User reviews, ships or rejects. Default mode for new monitors. Auto-rollback armed.
Continuous
Monitor auto-ships — escalates only when needed.
Safe changes ship without approval. Anything that breaches the "I need help" threshold (cluster gap, confidence drop, trade-off) escalates to the inbox. Rollback always available.
Why both modes: the trust ladder. Users start in HITL-Approve, see the monitor working well, and graduate it to Continuous when they trust it. Trust is per-monitor — Accuracy can be Continuous while Cost is still HITL.
#01 Sub-navigation — which tabs make sense
The per-agent Learning route had Findings · Tuning · Audit · Datasets as historical labels. After round-1 sparring, here's what each tab is and which to keep:
Tab
Role
Decision
Drawn in
Monitors
Home — all monitors at a glance, set up new ones, drill in
Keep. Renamed from "Goals" to match the mental model.
A1, A3, B1, B2
Findings
Inbox of decisions: pending + recently shipped + watching
Keep. Primary surface for HITL-Approve mode.
C1, C2
Datasets
Verification + training data the monitors use
Keep. CRUD on golden sets, edge cases, synthetic.
E1
Audit
Immutable log — every change, signal, rollback
Keep. Compliance-grade, read-only, exportable.
F1
Tuning
(historical label)
Drop. Split: active tuning state → Findings, history → Audit.
Every variant in this doc is checked against these.
P1
Goals are the spine.
Page keyed off what the user is trying to achieve, not Beam's internal taxonomy.
P2
Scalable to new monitors & surfaces.
New monitor (drift, safety, capability) or new surface (model swap, graph reorg) doesn't require redesign.
P3
Customization first-class.
Each goal has explicit setup — target, mode, verification, rollback, trade-offs.
P4
Per-monitor trust ladder.
Off → Watch → Approve → Continuous set independently per monitor.
P5
Operational tone.
Beam shell at runtime density. No marketing copy.
P6
Every claim has evidence.
Findings show dataset, validation, deltas, risks.
P7
Reversible by default.
Auto-rollback armed; every shipped change has rollback.
P8
Show what Beam is doing.
Pending + in-flight + recent activity all visible.
#03 Flow map
Hot path (thick stroke) = first-time owner setting up the first goal. Clay nodes = open forks. Dashed edges = loops or the alternative spine.
Spine decision
Cockpit vs Inbox is a real spine fork. They answer different first questions ("am I on track?" vs "what needs my call?"). Doc carries both forward — synthesis at #08 ships them as sub-nav tabs.
A
Flow A · Setup
From empty hub to a configured monitor.
A1 empty hub · A2 configure (modal) · A3 1-of-N saved · A4 edit. The first-time owner path. Lives in the Monitors tab.
A1 · Empty hub — locked
First-run state. The hero CTA replaces today's "Tool optimisation" card; the 4 empty axis tiles replace today's "Test & eval / Agent tuning" Coming-soon pillars. Goals not pillars (P1, P2).
Locked
No goals yet · pick the first axis to set up
Locked
B
Invoice agent›Learning
Search…
⚡ 2.5
Learning hub
Tell Beam what to optimise for. Set up a monitor per metric.
GoalsFindingsDatasetsTuningAudit
◎
Set what you want this agent to optimise for
A monitor listens to a metric, gathers data, surfaces work, and asks for your help. Pick one to start.
Accuracy
Track how often the agent gets it right.
Most teams start here
Cost
Track $/task & per-tool tokens.
Latency
Track p95/p99; find bottlenecks.
Robustness
Watch silent regressions & clusters.
Answers
"What does Beam want from me to start?"
New components
Hub empty CTA · Monitor tile (empty)
Brief
P1 ✓ P2 ✓
A2 · Configure (setup affordance) — LOCKED on Variant B (modal)
The customization screen that was missing from the dashboard. Same form across every variant — what differs is where the form lives. Locked on Variant B (centered modal) in round 1: "helps focus on configuration." The other variants are kept visible with their reasons so future readers see the alternatives explored.
Variant A
Right-rail drawer (480px) · Beam-native
Reject · was previous lean
B
Invoice agent›Learning
Search…
⚡ 2.5
Learning hub
GoalsFindingsDatasetsTuning
Accuracy
Track…
Cost
Track…
Latency
Track…
Robust.
Watch…
Set up Accuracy goal✕
Tell Beam how accurate this agent should be, how to verify, when it can act on its own.
Target accuracy *
95% by Jul 14
90%95%99%Custom…
Mode
Off
Watch only
Human-approve
Beam proposes, you decide.
Continuous
🔒 30d
Verify against
golden_v3 · 412 ex▾
Auto-rollback
Revert if -5pt on cluster / 24h
Trade-off rules
No cost > $0.05/task
No p95 > +1s
Editable later.
Bet
Context preservation; Beam-native (Tasks/Flow use drawers).
Weakest at
Vertical cramping for the now-meaty 8-section config form — drawer is too narrow.
Brief
P5 ✓ · Rejected in favour of Variant B: user preferred modal for "focus on configuration." Drawer is reused at A4 for quick edits.
Variant B · Pick
Centered modal · extended configuration (cost ceiling, signal sources, help threshold, notifications)
Pick
B
Invoice agent›Learning
Search…
⚡ 2.5
Set up Accuracy monitor✕
A monitor listens to a metric, gathers data, surfaces work, and asks for your help when it needs you.
TARGET ACCURACY *
95% by Jul 14, 2026
90%95%99%Custom
MODE
OffWatch onlyHuman-approveContinuous 🔒 30d
Beam proposes, you decide. Continuous unlocks after 30d of stable approvals.
COST CEILING — hard cap per task
$0.10/task · don't ship anything that breaches this
VERIFY AGAINST
golden_v3 · 412 examples · 94% pass▾
SIGNAL SOURCES — what this monitor learns from
Production tasks · score every run (live)
HITL corrections · learn from human fixes
Replay vs golden set · scheduled validation
Synthetic edge cases · stress-test against generated examples
"I NEED HELP" THRESHOLD — when does the monitor escalate
Confidence drop > 10pt on any tool
New failure cluster > 5% of traffic
Stuck for > 48h without progress
AUTO-ROLLBACK
Revert if -5pt on cluster within 24h
Strong default for accuracy. Most rollbacks fire within minutes.
NOTIFICATIONS — how the monitor surfaces work
Inbox only+ toast+ banner+ email
All settings can be edited later · open this modal from the monitor's ⋯ menu
Bet
Focus on configuration — no peripheral distraction. Picked over the drawer because the form is now meaty (8 sections); a modal commits the user to making the decision before returning to the hub.
Weakest at
Heavier for "just toggle mode" edits — but the inline mode-pill popover (see A4 Variant B) handles that case.
Brief
P3 ✓ P7 ✓ · Same modal handles setup AND edit.
Variant C
Dedicated setup page · evidence side panel
Reject · evidence relocated
B
Invoice agent›Learning›Set up Accuracy
Set up Accuracy goal
Tell Beam how accurate this agent should be, how it should verify, when it can act on its own.
TARGET
95% by Jul 14
90%95%99%Custom
MODE
Off
Watch
Approve (rec.)
Continuous
🔒
AUTO-ROLLBACK
Revert if -5pt / 24h
Bet
Live tool scores while configuring — informs target choice (P6).
Weakest at
Click cost; heavy for editing.
Brief
P3 ✓ P6 ✓; P5 mixed (no page-form precedent in Beam)
Variant D
Inline expand — tile replaces itself with form
Reject · ergonomic relocated to A4-B
B
Invoice agent›Learning
Learning hub
GoalsFindingsDatasets
Cost
Set up Accuracy
✕
Target
95%
Mode
OffWatchApproveCont.
Latency
Robust.
Bet
Minimal perceived interruption.
Weakest at
Cramped in 1 tile width; doesn't scale to 4-axis setup back-to-back.
Brief
P3 ✓; P5 mixed (no precedent)
Trade-off table
Variant
Best at
Weakest at
Brief
A · Drawer
Context preservation; Beam-native
Cramped for 8-section config form
P5 ✓
B · Modal
Focus on configuration; room for the meaty form
Heavy for "just toggle mode" edits
P3 ✓ P7 ✓
C · Page
Live evidence side panel
Click cost; heavy for editing
P3 ✓ P6 ✓; P5 mixed
D · Inline
Lightest perceived interruption
Cramped; doesn't scale
P3 ✓; P5 mixed
Pick · Variant B (Modal) · locked round 1"Helps focus on configuration." The agent's initial lean was Variant A (drawer) for Beam-nativeness, but the user overrode: with the extended config (now 8 sections — target / mode / cost ceiling / verify / signals / help threshold / rollback / notifications) the form is too meaty for a 480px drawer, and a centered modal forces the user to commit to setting it up before returning to the hub.
Reject-or-relocate: C's evidence panel relocated into the drill-in side panel (B2). D's inline ergonomic carried to A4 as a 1-click mode-pill popover for quick edits. Drawer (A) reused at A4 only for partial edits where the modal feels heavy.
A3 · 1-of-4 (between-state) — locked
Between "I clicked save" and "Beam has data" there's a several-minute gap. The Accuracy tile's "Gathering first measurement…" state is the only signal the goal actually landed in a real measurement loop. P8.
Locked
Accuracy populated, others "+ Set up"
Locked
B
Invoice agent›Learning
Search…
⚡ 2.5
Learning hub
GoalsFindingsDatasetsTuningAudit
Mission
Reach 95% accuracy by Jul 14, 2026.
Add cost / latency / robustness for more leverage.
Owner
Anna S.
Verify
golden_v3
1 of 4 monitors set up. Add cost or latency so Beam can flag trade-offs.Add another →
Accuracy
Approve
—%
/ 95% goal
Gathering first measurement…
Verifygolden_v3 · 412 ex
In flightFirst replay queued · ETA 5m
Cost
Track $/task & tokens.
Latency
Track p95 / p99.
Robustness
Watch silent regress.
Answers
"Did my goal save? What now?"
Key state
"—% · Gathering first measurement… · ETA 5m"
Brief
P8 ✓
A4 · Edit affordance — LOCKED: same modal as setup + inline mode-pill
Edit-existing-configuration uses the same centered modal as A2 setup, populated with current values + an edit-history meta strip. For the most frequent edit (mode change → graduation Approve → Continuous), a 1-click inline mode-pill popover lives on the monitor tile. The wireframe below still shows the right-rail drawer pattern — slated for redraw next round to match the locked A2 modal.
Variant A · needs redraw
Edit modal · same as A2 (currently shown as drawer — redraw pending)
Carry · redraw as modal
B
Invoice agent›Learning
Search…
Learning hub
GoalsFindingsDatasets
Acc.
86%
Cost
$0.34
Lat.
4.1s
Robust
2
Edit Accuracy goal✕
Now 86% / 95%. ▲ +2.4 pt this week.
Last edit: 3d agoShipped: 6RB: 1
Target
95% by Jul 14
Mode
Off
Watch
Approve (current)
18d stable so far.
Continuous
unlocks in 12d
🔒
Beam: graduate at 30d. You're at 18d.
Verify
golden_v3 · 412 ex▾
Auto-rollback
Revert -5pt / 24h
Last fired Sat.
Applies to NEW findings.
Bet
Same modal as A2 — populated vs empty. Cognitive model unified.
Weakest at
4-click cost for "toggle mode" — handled by Variant B inline popover below.
Brief
P3 ✓ P7 ✓ · Pending redraw to show modal not drawer.
Variant B · Layer on A
Inline mode-pill popover on the tile
Carry · enhancement
B
Invoice agent›Learning
Learning hub
GoalsFindings
Accuracy
Approve ▾
86%
/ 95%
▲ +2.4 pt
Change mode
Off
Watch
Approve (current)
Continuous
🔒 12d
Cost
Watch
$0.34
Lat.
Watch
4.1s
Robust
Watch
2
Bet
1-click for most common edit (mode graduation Approve → Continuous).
Weakest at
Two patterns to maintain — first-time users hit inconsistency.
Brief
P5 ✓; P3 mixed — only mode is inline, rest goes through drawer
Variant C
Drill-in Settings tab only
Reject
B
Invoice agent›Learning›Accuracy
← GoalsAccuracy
FindingsActivityWatchingSettings
Goal settings
Target
95%by Jul 14
Mode
Approve ▾
Verify
golden_v3 ▾
Rollback
-5pt threshold
Bet
Single source of truth — one place for all axis controls.
Weakest at
3-click depth (hub → drill-in → settings → form) — too slow for mode change.
Brief
P5 ✓; P3 ✗ (not first-class at hub level)
Pick · same modal as A2 + inline mode-pill · locked round 1Same modal handles both setup and edit — populated with current values, edit-history meta strip at top, mode-graduation hint inline (Variant A). For the most frequent edit — mode change at the trust-ladder moment (Approve → Continuous) — a 1-click inline mode-pill popover lives on the monitor tile (Variant B). Variant C (drill-in Settings tab) rejected: 3-click depth too slow for the most common edit.
B
Flow B · Operate
Live monitors at a glance, drill in when you need to.
B2 · Monitor detail (drill-in · Open Accuracy) — locked
Per-monitor full page. KPI + sparkline, scoped findings, side-rail with monitor settings + Focus tools (pulled from live Flow page node-quality). Pending: flatter drill-in variants per round-1 feedback — full-page nav reads as too deep. Three flatter bets to draw next: inline expand-in-place, sticky detail panel under the hub, or "no drill-down" (everything important on the tile).
C1 combined inbox (one unified list of decisions) · C2 finding detail (the HITL review screen). Primary surface for HITL-Approve mode. Lives in the Findings tab.
C1 · Combined Inbox — primary · locked round 1
Promoted from alternative spine → primary surface. Confirmed in round 1: "I very much like the inbox. This is also not bad, because this is something we got as feedback in the user interview as well. If there is something, just surface it up." The inbox is where the monitors push things that need the user — it's the partner surface to the monitor hub, not a fallback.
Locked · sub-nav tab
Briefing headline · unified decisions list · recent activity
Locked
B
Invoice agent›Learning
Search…
⚡ 2.5
Learning hub
Mode: ● Continuous · you approve high-risk only
InboxGoalsDatasetsTuningAudit
This week · May 1–8
Your agent got +2.4 pts more accurate, no change in cost, 0.6s faster on p95.
Mon autoaccuracyparse_date · prompt78→94% on cluster
Sat ↺costclassify_doc · swap rolled back−11% acc on multi-currency
Answers
"What needs my call right now?"
When it wins
Multi-axis trade-offs (one card carries both deltas); autopilot day
Brief
P6 P7 P8 ✓
Cockpit vs Inbox — when each wins
Use case
Cockpit (B1)
Inbox (C1)
"Am I on track?"
Wins
Briefing helps, scroll needed
"What needs my call?"
Pending buried in tile
Wins — list IS the page
Multi-axis trade-offs
Span 2 tiles awkwardly
Wins — one card, both deltas
First-time owner
Wins — empty axes guide setup
Empty inbox is a dead end
Autopilot day
Tiles feel static
Wins — Activity is the story
Pick · both ship as primary · locked round 1Monitor hub and Inbox are partner surfaces, not "main + alternative." Hub is "what am I tracking and how is each monitor doing." Inbox is "what needs my call right now." Both reachable as sub-nav tabs from the per-agent Learning route. Hub's summary footer deep-links to Inbox; Inbox stat tiles link back to Hub per monitor.
C2 · Finding detail — 3 variants · awaiting pick
What happens when the user clicks a finding from the C1 inbox to review it. This is the HITL decision screen — where the user actually says ship / reject. The form is the same in every variant (diff + validation + decision bar); what differs is where the screen lives and how it interrupts the inbox.
#140Swap classify_doc · cost · model swap · ⚠ trade-off1d · click to review
Bet
No page nav at all. Pure progressive disclosure inside the inbox.
Weakest at
Cramped — the diff + impact need to fit in 1 row's width. Long validation lists scroll the expanded row.
Brief
P5 ✓ · Flattest of the three. P6 mixed — less room for evidence.
LeanVariant B (drawer). Inbox stays visible on the left so you don't lose your place in the queue; the right pane has enough room for the diff + impact + validation without scroll. A and C are the polar ends — A maximises evidence room (but is deep), C maximises flatness (but cramps the evidence). B is the calibrated middle. Awaiting your pick.
D
Flow D · Continuous
Monitor auto-ships, escalates only when needed.
D1 autonomous visibility — what the user sees when no human decision is required. Safe changes ship without approval; anything past the "I need help" threshold escalates to the Findings inbox.
D1 · Continuous mode — autonomous visibility
When a monitor is in Continuous mode, the user isn't blocked — the monitor ships safe changes itself. But the user still wants visibility into what's happening. This is the monitor's surface when no human decision is required.
Single wireframe · concept
Continuous monitor · live activity stream + escalation strip
In play
B
Invoice agent›Learning›Accuracy
Search…
← MonitorsAccuracy● CONTINUOUS · autonomous
⚠ Needs your call1 trade-off finding escalated · classify_doc model swap (cost ↓12% · acc ↓1.1pt)
86%
Goal: 95% · Continuous since Apr 19 (20d)
▲ +2.4 pt 7d · +12 pt 30d
Auto-shipped last 24h
4 changes · 0 rollbacksView 30d
04:12 autopromptparse_date_field · re-tuned on date cluster78 → 94% on cluster · 0 reg
02:47 autopromptmatch_vendor · German handwriting prompt+0.3 pt overall · 21 ex
Even in continuous mode the user wants to glance at what happened. Activity stream is the answer. Escalation strip is the only time the monitor demands attention.
Brief
P7 ✓ P8 ✓ · Every shipped change has rollback inline.
E
Flow E · Datasets
Verification & training data the monitors use.
E1 index — three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config.
E1 · Datasets — verification & training data
The substrate the monitors verify against. Three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config (see A2 modal · Verify field).
Single wireframe · concept
Datasets index · table with sources + usage
In play
B
Invoice agent›Learning›Datasets
Search datasets…
Datasets
Beam runs replay validation against these before shipping changes. Build new ones from production runs, upload, or synthetic edges.
vw_handwriting · field capturesupload8972%Accuracy
5 datasets · 2,995 total examples · last updated 4m ago (production_7d auto-refreshes)
Anatomy
Sub-nav · source filter chips · "+ Build dataset" CTA · table with name / source / examples / pass rate / used-by / open
Bet
Show usage. Each dataset says which monitors reference it — that's how the user knows which to curate vs leave alone.
Brief
P6 ✓ · Datasets ARE the evidence substrate.
F
Flow F · Audit
Immutable history — read-only, exportable.
F1 log — every change, every signal, every rollback. The page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"
F1 · Audit log — immutable history
Every change, every signal, every rollback. Read-only. Exportable. Filterable. This is the page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"
Single wireframe · concept
Audit log · filterable event feed with diff preview
In play
B
Invoice agent›Learning›Audit
Search by tool / ID…
Audit log
Every change is recorded here · read-only · exportable to CSV / JSONL.
Sat 14:33CostBBeam · autoRollback · classify_doc model swap reverted−11% acc on multi-currency
Sat 14:21CostMKMarc K.Approve · classify_doc · GPT-4o → Haikucost −12%
Fri 11:08Robust.ASAnna S.Signal · new cluster detected: multi-currency8% of traffic
Showing 6 of 247 events · scroll for more · expand any row (▾) for full diff and signal trail
Anatomy
Sub-nav · filter chips (time / monitor / type / actor) · Export · table with when / monitor / actor / action / effect / expand-for-diff
Bet
Same row shape as the Findings activity table — visual continuity, just more columns and filterable.
Brief
P7 ✓ · Reversibility = visible rollback history; P8 ✓ · everything Beam did, visible. Updated: actor column now shows avatar + name (Anna S., Marc K.) — Beam · auto rows use a filled dark avatar so it's distinguishable at a glance.
#14 Sub-nav variants — 3 options · awaiting pick
The sub-nav itself has a fork. The current pick in #01 was object-based — but two alternatives are worth surfacing so you can react.
Each tab is a verb — the action the user is here to do. Object names are secondary labels.
Bet
Verb-discoverability. First-time users grasp "Decide" faster than "Findings."
Weakest at
Doesn't match Beam's existing surfaces. Two-word tabs are wider — TOC strip eats horizontal space.
Brief
P3 ✓ Customization-first (Setup is named); P5 mixed not Beam-native.
Variant C
Mode-based · HITL · Continuous · Reference
In play
B
Invoice agent›Learning
Learning hub
Two modes, two surfaces. Plus reference.
HITL Monitors 3Continuous MonitorsDatasetsAudit
Mode is the spine. Reference tabs (Datasets, Audit) are demoted right.
Bet
Surfaces the trust ladder at the nav. User sees their HITL vs Continuous monitors at a glance.
Weakest at
Mode-split splits the monitors physically — Accuracy in HITL, Cost in Continuous, no single place to see "the agent." Also doesn't scale if monitors graduate frequently (re-routing on mode change).
Brief
P1 ✗ Goals-as-spine broken if monitors are split by mode.
LeanVariant A (object-based). Matches Beam's existing surfaces. Variant B's verb-discoverability is a real win for new users, but inconsistency with the rest of the platform costs more than it gains. Variant C breaks P1 (goals-as-spine). Awaiting your pick.
#15 Metrics philosophy — accuracy isn't enough
From round-2 sparring: "Accuracy is a good metric, but initially when the agent is trained on a small use case, its accuracy is gonna be high. Then as we increase the scope of the agent… its accuracy will go down, even though the agent became better."
Right. Bare accuracy rewards staying narrow. An agent at 95% on 5 task types looks "better" than one at 80% on 50 — but the second is doing 8× the work. Three ways the UI can honor this:
Option
What it does
Cost
Lean
A · Pair accuracy with coverage inline
Every accuracy number shows what it's a percentage of: 86% · 412 ex · 47 distinct task types · 8% un-categorised. Denominator always visible.
Free — just copy changes.
Yes — prevents the trap immediately
B · Coverage as a first-class monitor
Add a 5th monitor type. User sets a goal: "handle 80% of production task types · grow by +5 task types/month." Scope expansion becomes a deliberate act, surfaced as findings.
Medium — adds one tile + one config option + new finding type.
Yes — closes the gap
C · Replace accuracy KPI with a 2-axis chart
On the Accuracy drill-in: scatter of accuracy-per-task-type × task-type-frequency. User sees "95% on common, 60% on long tail."
Expensive — needs real-data visualization.
Carry · for B2 drill-in next round
Proposed directionAdopt A + B together. A is free and fixes the framing immediately. B promotes Coverage to a first-class monitor — connecting directly to Datasets (via E4, "use this dataset to teach the agent a new task type"). Coverage findings: "new task type detected — multi-currency invoices (8% of traffic, 0 examples in any dataset). Add to scope?" — different glyph from accuracy / cost findings.
How a Coverage monitor reads on the hub
Sketch · proposed addition to A1 / B1
Coverage tile (proposed 5th monitor)
In play
AccuracyAPPROVE
86%
/ 95% goal
▲ +2.4 pt 7d
measured on412 ex · 47 task types · 8% un-categorised
CoverageWATCH
47
task types · 92% of traffic
▲ +5 types 30d
scope gap8% of traffic in 3 un-known task types · Triage →
Anatomy
Accuracy tile (left) now carries a "measured on" footer with denominator context. Coverage tile (right) is the proposed new monitor — a separate measure of "how much the agent can do."
Bet
Accuracy + Coverage answer two different questions — "how often right?" and "how broadly?" — together they capture the real direction of travel.
Brief
Closes the philosophy gap. Adds 1 monitor type, 1 finding type ("new task type detected"), 1 connection (Coverage → Datasets via E4).
Same centered modal as A2 (locked round 1) — populated with current values, edit-history strip at the top, mode-graduation hint inline, plus a small Delete monitor action in the footer that A2 setup doesn't need.
Locked · modal pattern matches A2
Edit Accuracy monitor · populated
Locked
B
Invoice agent›Learning
Search…
Edit Accuracy monitor✕
Currently 86% · goal 95% · ▲ +2.4 pt this week. ASLast edited 3d ago · Anna S.· Shipped since: 6· Auto-rollbacks: 1
→ Beam: graduate to Continuous at 30d stable. You're at 18d.
COST CEILING
$0.10/task · hard cap
VERIFY AGAINST
golden_v3 · 412 examples · 94% pass▾
SIGNAL SOURCES
Production tasks
HITL corrections
Replay vs golden set
Synthetic edge cases
"I NEED HELP" THRESHOLD
Confidence drop > 10pt on any tool
New failure cluster > 5% of traffic
AUTO-ROLLBACK
Revert if -5pt on cluster within 24h
Last fired Sat (classify_doc model swap, multi-currency cluster).
What's new vs A2
Edit-history meta strip (last edited + actor avatar + shipped since + rollbacks) · "Delete monitor" action in footer (destructive, left side) · mode-graduation hint inline · "Save changes" instead of "Save & start listening."
Bet
Same modal pattern as setup — populated vs empty. One mental model, two states.
Brief
P3 ✓ P5 ✓ P7 ✓ · Replaces the stale drawer redraw from S5.
Click into a dataset from E1. Three sub-tabs: Examples (the rows), Used by (which monitors verify against it), Pass rate (per-monitor accuracy on this dataset over time).
Single wireframe · concept
Dataset detail · default to Examples tab
In play
B
Invoice agent›Learning›Datasets›golden_v3
Search examples…
← Datasetsgolden_v3upload · 412 examples
Examples (412)Used by (3 monitors)Pass rate
All 412Must-pass · 47Edge cases · 89Currently failing · 18
IDInput snippetExpectedStatus
#1247Invoice 4827 · DE · DD.MM.YY date format · €647.50amount: 647.50✓ pass
#1248VW PO #44831 · German handwriting · multi-linevendor: VW✗ fail
#1249Multi-currency · USD $200 + EUR €180 in same lineflag for review✗ fail
#1250Standard ISO date · $1,243.00 · single lineamount: 1243.00✓ pass
Showing 4 of 412 · scroll for more394 pass · 18 fail · 94% rate
Anatomy
3 sub-tabs · filter chips (must-pass / edge / failing) · table of examples with pass/fail status · side-rail with Used by + Provenance
#18 E3 · Curate — build / add examples to a dataset
Add examples to a dataset from three sources: promote from production runs (most common — user sees a failure in the inbox, decides "we should test for this"), upload JSONL/CSV, or generate synthetic edges. Per example: label, must-pass flag, edge-case flag.