Beam · Learning Hub redesign · wireframes

The agent's monitors are the spine of the hub.

TL;DR · v2 · locks from sparring round 1 Today's Learning Hub is keyed off engineering pillars. The redesign keys it off what the user is trying to achieve. Each axis becomes a monitor the user sets up: it listens to a metric, gathers data, surfaces work, and asks for help when it needs you.

Locked this round: mental model = monitors (was "axes" — sharper metaphor) · A2 configure affordance = centered modal (Variant B picked) · edit pattern = same modal, populated · configuration extended with cost ceiling, signal sources, "I need help" threshold, notifications · Inbox is primary, not alternative — partner to the monitor hub.

Still open: flatter drill-in alternatives · continuous-mode visibility screen · HITL approve flow.

How to read this doc. Eight screens, each drawn as a gray-and-white wireframe. Variants live inside .variation cards — the picked one is oat-tinted and tagged Pick; rejected/carried variants stay visible with their reason. The page chrome (ivory bg, serif headlines, clay accents) is the doc's voice; the wireframes are pure neutral grays — they describe structure, not visual design.

Round 2 · added in this pass Per round-1 feedback the doc now adds: problem statement (#00b) · sub-nav rationale (#01b) · C2 Finding detail with 3 variants (HITL review screen, click-through from inbox) · D1 Continuous monitor visibility · E1 Datasets index · F1 Audit log. Flow labels reorganised: S1–S7 → A1–F1 (Setup / Operate / Decision / Continuous / Datasets / Audit). Still queued: flatter B2 drill-in variants · A4 modal redraw to match locked A2.

Concepts

#00Problem #01Sub-nav #02Principles #03Flow map

Flow A · Setup

A1Empty hub A2Configure · Fork A A2-ADrawer · reject A2-BModal · pick A2-CPage · reject A2-DInline · reject A31-of-4 A4Edit · Fork B A4-AModal · carry A4-BMode-pill · carry A4-CSettings tab · reject

Flow B · Operate

B1Monitor hub B2Monitor detail

Flow C · HITL Decision

C1Combined inbox C2Finding detail · Fork open C2-AFull-screen page C2-BSlide-over drawer · carry C2-CExpand-in-place

Flow D · Continuous

D1Autonomous visibility

Flow E · Datasets

E1Index

Flow F · Audit

F1Log

#00 The problem we're solving

The user has an agent in production. They want it to keep getting better against the things they care about — accuracy, cost, latency, robustness, drift, capability.

The mechanism is monitors. Each one:

Tracks one metric (accuracy / cost / latency / robustness / …).
Knows the user's target (e.g. 95% accuracy by Jul 14).
Knows the user's budget (e.g. don't ship anything that pushes cost over $0.10/task).
Listens to signals — production tasks, HITL corrections, golden-set replay, synthetic edges.
Surfaces work when it has a fix to propose, or escalates when it needs help.

The interface gives the user the tools to do three jobs:

Train — provide labels, corrections, golden examples. (Datasets tab.)
HITL optimize — review monitor findings, approve / reject / rollback. (Findings tab.)
Audit — read the immutable log of every change, signal, and rollback. (Audit tab.)

Each monitor runs in one of two operational modes:

HITL · Approve

Monitor proposes — user decides.

Every change surfaces as a finding in the inbox. User reviews, ships or rejects. Default mode for new monitors. Auto-rollback armed.

Continuous

Monitor auto-ships — escalates only when needed.

Safe changes ship without approval. Anything that breaches the "I need help" threshold (cluster gap, confidence drop, trade-off) escalates to the inbox. Rollback always available.

Why both modes: the trust ladder. Users start in HITL-Approve, see the monitor working well, and graduate it to Continuous when they trust it. Trust is per-monitor — Accuracy can be Continuous while Cost is still HITL.

#01 Sub-navigation — which tabs make sense

The per-agent Learning route had Findings · Tuning · Audit · Datasets as historical labels. After round-1 sparring, here's what each tab is and which to keep:

Tab	Role	Decision	Drawn in
Monitors	Home — all monitors at a glance, set up new ones, drill in	Keep. Renamed from "Goals" to match the mental model.	A1, A3, B1, B2
Findings	Inbox of decisions: pending + recently shipped + watching	Keep. Primary surface for HITL-Approve mode.	C1, C2
Datasets	Verification + training data the monitors use	Keep. CRUD on golden sets, edge cases, synthetic.	E1
Audit	Immutable log — every change, signal, rollback	Keep. Compliance-grade, read-only, exportable.	F1
Tuning	(historical label)	Drop. Split: active tuning state → Findings, history → Audit.	—

Flow → tab mapping:

Flow	Screens	Lives in tab
A · Setup	A1 empty hub · A2 configure (modal) · A3 1-of-N saved · A4 edit	Monitors
B · Operate	B1 monitor hub · B2 monitor detail (drill-in)	Monitors
C · HITL Decision	C1 inbox · C2 finding detail / review	Findings
D · Continuous Mode	D1 autonomous visibility · D2 escalation	Monitors (inline) + Findings (escalation)
E · Datasets	E1 index	Datasets
F · Audit	F1 log	Audit

#02 Locked principles

Every variant in this doc is checked against these.

Goals are the spine.

Page keyed off what the user is trying to achieve, not Beam's internal taxonomy.

Scalable to new monitors & surfaces.

New monitor (drift, safety, capability) or new surface (model swap, graph reorg) doesn't require redesign.

Customization first-class.

Each goal has explicit setup — target, mode, verification, rollback, trade-offs.

Per-monitor trust ladder.

Off → Watch → Approve → Continuous set independently per monitor.

Operational tone.

Beam shell at runtime density. No marketing copy.

Every claim has evidence.

Findings show dataset, validation, deltas, risks.

Reversible by default.

Auto-rollback armed; every shipped change has rollback.

Show what Beam is doing.

Pending + in-flight + recent activity all visible.

#03 Flow map

Hot path (thick stroke) = first-time owner setting up the first goal. Clay nodes = open forks. Dashed edges = loops or the alternative spine.

Spine decision Cockpit vs Inbox is a real spine fork. They answer different first questions ("am I on track?" vs "what needs my call?"). Doc carries both forward — synthesis at #08 ships them as sub-nav tabs.

Flow A · Setup

From empty hub to a configured monitor.

A1 empty hub · A2 configure (modal) · A3 1-of-N saved · A4 edit. The first-time owner path. Lives in the Monitors tab.

A1 · Empty hub — locked

First-run state. The hero CTA replaces today's "Tool optimisation" card; the 4 empty axis tiles replace today's "Test & eval / Agent tuning" Coming-soon pillars. Goals not pillars (P1, P2).

Locked

No goals yet · pick the first axis to set up

Locked

Invoice agent›Learning

Search…

⚡ 2.5

Learning hub

Tell Beam what to optimise for. Set up a monitor per metric.

Goals Findings Datasets Tuning Audit

◎

Set what you want this agent to optimise for

A monitor listens to a metric, gathers data, surfaces work, and asks for your help. Pick one to start.

Accuracy

Track how often the agent gets it right.

Most teams start here

Cost

Track $/task & per-tool tokens.

Latency

Track p95/p99; find bottlenecks.

Robustness

Watch silent regressions & clusters.

Answers

"What does Beam want from me to start?"

New components

Hub empty CTA · Monitor tile (empty)

Brief

P1 ✓ P2 ✓

A2 · Configure (setup affordance) — LOCKED on Variant B (modal)

The customization screen that was missing from the dashboard. Same form across every variant — what differs is where the form lives. Locked on Variant B (centered modal) in round 1: "helps focus on configuration." The other variants are kept visible with their reasons so future readers see the alternatives explored.

Variant A

Right-rail drawer (480px) · Beam-native

Reject · was previous lean

Invoice agent›Learning

Search…

⚡ 2.5

Learning hub

GoalsFindingsDatasetsTuning

Accuracy

Track…

Cost

Track…

Latency

Track…

Robust.

Watch…

Set up Accuracy goal ✕

Tell Beam how accurate this agent should be, how to verify, when it can act on its own.

Target accuracy *

95% by Jul 14

90% 95% 99% Custom…

Mode

Off

Watch only

Human-approve

Beam proposes, you decide.

Continuous

🔒 30d

Verify against

golden_v3 · 412 ex▾

Auto-rollback

Revert if -5pt on cluster / 24h

Trade-off rules

No cost > $0.05/task

No p95 > +1s

Editable later.

Bet

Context preservation; Beam-native (Tasks/Flow use drawers).

Weakest at

Vertical cramping for the now-meaty 8-section config form — drawer is too narrow.

Brief

P5 ✓ · Rejected in favour of Variant B: user preferred modal for "focus on configuration." Drawer is reused at A4 for quick edits.

Variant B · Pick

Centered modal · extended configuration (cost ceiling, signal sources, help threshold, notifications)

Pick

Invoice agent›Learning

Search…

⚡ 2.5

Set up Accuracy monitor ✕

A monitor listens to a metric, gathers data, surfaces work, and asks for your help when it needs you.

TARGET ACCURACY *

95% by Jul 14, 2026

90%95%99%Custom

MODE

Off Watch only Human-approve Continuous 🔒 30d

Beam proposes, you decide. Continuous unlocks after 30d of stable approvals.

COST CEILING — hard cap per task

$0.10/task · don't ship anything that breaches this

VERIFY AGAINST

golden_v3 · 412 examples · 94% pass▾

SIGNAL SOURCES — what this monitor learns from

Production tasks · score every run (live)

HITL corrections · learn from human fixes

Replay vs golden set · scheduled validation

Synthetic edge cases · stress-test against generated examples

"I NEED HELP" THRESHOLD — when does the monitor escalate

Confidence drop > 10pt on any tool

New failure cluster > 5% of traffic

Stuck for > 48h without progress

AUTO-ROLLBACK

Revert if -5pt on cluster within 24h

Strong default for accuracy. Most rollbacks fire within minutes.

NOTIFICATIONS — how the monitor surfaces work

Inbox only + toast + banner + email

All settings can be edited later · open this modal from the monitor's ⋯ menu

Bet

Focus on configuration — no peripheral distraction. Picked over the drawer because the form is now meaty (8 sections); a modal commits the user to making the decision before returning to the hub.

Weakest at

Heavier for "just toggle mode" edits — but the inline mode-pill popover (see A4 Variant B) handles that case.

Brief

P3 ✓ P7 ✓ · Same modal handles setup AND edit.

Variant C

Dedicated setup page · evidence side panel

Reject · evidence relocated

Invoice agent›Learning›Set up Accuracy

Set up Accuracy goal

Tell Beam how accurate this agent should be, how it should verify, when it can act on its own.

TARGET

95% by Jul 14

90%95%99%Custom

MODE

Off

Watch

Approve (rec.)

Continuous

🔒

AUTO-ROLLBACK

Revert if -5pt / 24h

Bet

Live tool scores while configuring — informs target choice (P6).

Weakest at

Click cost; heavy for editing.

Brief

P3 ✓ P6 ✓; P5 mixed (no page-form precedent in Beam)

Variant D

Inline expand — tile replaces itself with form

Reject · ergonomic relocated to A4-B

Invoice agent›Learning

Learning hub

GoalsFindingsDatasets

Cost

Set up Accuracy ✕

Target

95%

Mode

OffWatchApproveCont.

Latency

Robust.

Bet

Minimal perceived interruption.

Weakest at

Cramped in 1 tile width; doesn't scale to 4-axis setup back-to-back.

Brief

P3 ✓; P5 mixed (no precedent)

Trade-off table

Variant	Best at	Weakest at	Brief
A · Drawer	Context preservation; Beam-native	Cramped for 8-section config form	P5 ✓
B · Modal	Focus on configuration; room for the meaty form	Heavy for "just toggle mode" edits	P3 ✓ P7 ✓
C · Page	Live evidence side panel	Click cost; heavy for editing	P3 ✓ P6 ✓; P5 mixed
D · Inline	Lightest perceived interruption	Cramped; doesn't scale	P3 ✓; P5 mixed

Pick · Variant B (Modal) · locked round 1 "Helps focus on configuration." The agent's initial lean was Variant A (drawer) for Beam-nativeness, but the user overrode: with the extended config (now 8 sections — target / mode / cost ceiling / verify / signals / help threshold / rollback / notifications) the form is too meaty for a 480px drawer, and a centered modal forces the user to commit to setting it up before returning to the hub.

Reject-or-relocate: C's evidence panel relocated into the drill-in side panel (B2). D's inline ergonomic carried to A4 as a 1-click mode-pill popover for quick edits. Drawer (A) reused at A4 only for partial edits where the modal feels heavy.

A3 · 1-of-4 (between-state) — locked

Between "I clicked save" and "Beam has data" there's a several-minute gap. The Accuracy tile's "Gathering first measurement…" state is the only signal the goal actually landed in a real measurement loop. P8.

Locked

Accuracy populated, others "+ Set up"

Locked

Invoice agent›Learning

Search…

⚡ 2.5

Learning hub

GoalsFindingsDatasetsTuningAudit

Mission

Reach 95% accuracy by Jul 14, 2026.

Add cost / latency / robustness for more leverage.

Owner

Anna S.

Verify

golden_v3

1 of 4 monitors set up. Add cost or latency so Beam can flag trade-offs. Add another →

Accuracy

Approve

—%

/ 95% goal

Gathering first measurement…

Verifygolden_v3 · 412 ex

In flightFirst replay queued · ETA 5m

Cost

Track $/task & tokens.

Latency

Track p95 / p99.

Robustness

Watch silent regress.

Answers

"Did my goal save? What now?"

Key state

"—% · Gathering first measurement… · ETA 5m"

Brief

P8 ✓

A4 · Edit affordance — LOCKED: same modal as setup + inline mode-pill

Edit-existing-configuration uses the same centered modal as A2 setup, populated with current values + an edit-history meta strip. For the most frequent edit (mode change → graduation Approve → Continuous), a 1-click inline mode-pill popover lives on the monitor tile. The wireframe below still shows the right-rail drawer pattern — slated for redraw next round to match the locked A2 modal.

Variant A · needs redraw

Edit modal · same as A2 (currently shown as drawer — redraw pending)

Carry · redraw as modal

Invoice agent›Learning

Search…

Learning hub

GoalsFindingsDatasets

Acc.

86%

Cost

$0.34

Lat.

4.1s

Robust

Edit Accuracy goal✕

Now 86% / 95%. ▲ +2.4 pt this week.

Last edit: 3d agoShipped: 6RB: 1

Target

95% by Jul 14

Mode

Off

Watch

Approve (current)

18d stable so far.

Continuous

unlocks in 12d

🔒

Verify

golden_v3 · 412 ex▾

Auto-rollback

Revert -5pt / 24h

Last fired Sat.

Applies to NEW findings.

Bet

Same modal as A2 — populated vs empty. Cognitive model unified.

Weakest at

4-click cost for "toggle mode" — handled by Variant B inline popover below.

Brief

P3 ✓ P7 ✓ · Pending redraw to show modal not drawer.

Variant B · Layer on A

Inline mode-pill popover on the tile

Carry · enhancement

Invoice agent›Learning

Learning hub

GoalsFindings

Accuracy

Approve ▾

86%

/ 95%

▲ +2.4 pt

Change mode

Off

Watch

Approve (current)

Continuous

🔒 12d

Cost

Watch

$0.34

Lat.

Watch

4.1s

Robust

Watch

Bet

1-click for most common edit (mode graduation Approve → Continuous).

Weakest at

Two patterns to maintain — first-time users hit inconsistency.

Brief

P5 ✓; P3 mixed — only mode is inline, rest goes through drawer

Variant C

Drill-in Settings tab only

Reject

Invoice agent›Learning›Accuracy

← Goals Accuracy

FindingsActivityWatchingSettings

Goal settings

Target

95%by Jul 14

Mode

Approve ▾

Verify

golden_v3 ▾

Rollback

-5pt threshold

Bet

Single source of truth — one place for all axis controls.

Weakest at

3-click depth (hub → drill-in → settings → form) — too slow for mode change.

Brief

P5 ✓; P3 ✗ (not first-class at hub level)

Pick · same modal as A2 + inline mode-pill · locked round 1 Same modal handles both setup and edit — populated with current values, edit-history meta strip at top, mode-graduation hint inline (Variant A). For the most frequent edit — mode change at the trust-ladder moment (Approve → Continuous) — a 1-click inline mode-pill popover lives on the monitor tile (Variant B). Variant C (drill-in Settings tab) rejected: 3-click depth too slow for the most common edit.

Flow B · Operate

Live monitors at a glance, drill in when you need to.

B1 monitor hub (4 tiles, mission + summary footer) · B2 monitor detail (per-monitor KPI, scoped findings, side-rail tools). Lives in the Monitors tab.

B1 · Monitor hub (multi-monitor dashboard) — locked · in Figma

Two weeks in. Full mission. All 4 monitors populated. Per-tile ⋯ opens the edit modal (same as A2).

Locked · destination

All 4 monitors live · summary footer · already at 1322:106 · B1

Locked

Invoice agent›Learning

Search…

⚡ 2.5

Learning hub

GoalsFindingsDatasetsTuningAudit

Mission

Reach 95% acc before VW go-live, cost <$0.10, p95 <2s.

Owner

Anna S.

Deadline

Jul 14

Budget

$847 / $1.2k

Accuracy

Approve

86%

/ 95% goal

▲ +2.4 pt

FocusBlocking: extract, match

In flight3 findings waiting

Cost

Watch

$0.34

/ <$0.10

▲ +3% w/w

Budget$847 / $1.2k

In flight1 trade-off pending

Latency

Watch

4.1s

/ <2s p95

▼ −0.6s 7d

B'neckclassify, extract

In flight1 finding ready

Robust.

Watch

silent (7d)

▲ new cluster

Golden412 ex · 94% pass

In flight1 cluster decide

Across all goals · 7d: 3 waiting 6 shipped 1 rolled back 4 watching Activity →

Answers

"Am I getting where I'm going on the things I care about?"

Anatomy

4 tiles · identical anatomy · summary footer (not a 5th tile)

Brief

P1 P2 P4 P5 P7 P8 ✓

B2 · Monitor detail (drill-in · Open Accuracy) — locked

Per-monitor full page. KPI + sparkline, scoped findings, side-rail with monitor settings + Focus tools (pulled from live Flow page node-quality). Pending: flatter drill-in variants per round-1 feedback — full-page nav reads as too deep. Three flatter bets to draw next: inline expand-in-place, sticky detail panel under the hub, or "no drill-down" (everything important on the tile).

Locked · full page

/agent/<id>/learning/accuracy

Locked

Invoice agent›Learning›Accuracy

Search…

⚡ 2.5

← Goals Accuracy ● Approve

86%

Goal: 95% by Jul 14

▲ +2.4 pt · +12 pt 30d

Findings (3)Activity (4)Watching (1)Settings

✎

Re-tune extract_amount · date cluster

prompt · 47 ex +0.4 pt

✎

Tune match_vendor · German handwriting

prompt · 21 ex +0.3 pt

⊟

12 HITL tasks · stitch_multipage

feeds memory

Answers

"What's actually happening to accuracy?"

Anatomy

KPI + sparkline · tabs · scoped findings · side-panel of settings & focus tools

Brief

P1 P6 P7 ✓

Flow C · HITL Decision

The user's queue — review, ship, reject.

C1 combined inbox (one unified list of decisions) · C2 finding detail (the HITL review screen). Primary surface for HITL-Approve mode. Lives in the Findings tab.

C1 · Combined Inbox — primary · locked round 1

Promoted from alternative spine → primary surface. Confirmed in round 1: "I very much like the inbox. This is also not bad, because this is something we got as feedback in the user interview as well. If there is something, just surface it up." The inbox is where the monitors push things that need the user — it's the partner surface to the monitor hub, not a fallback.

Locked · sub-nav tab

Briefing headline · unified decisions list · recent activity

Locked

Invoice agent›Learning

Search…

⚡ 2.5

Learning hub

Mode: ● Continuous · you approve high-risk only

InboxGoalsDatasetsTuningAudit

This week · May 1–8

Your agent got +2.4 pts more accurate,
no change in cost, 0.6s faster on p95.

4 auto-shipped · 3 queued for you.

Accuracy

86%

▲ +2.4 pt

Cost

$0.34

— no change

p95

4.1s

▼ −0.6s

Decisions waiting · 3 · oldest 3h

⇆

Parallelize classify_doc + extract_amount latency graph reorg #142 · 3h

No data dependency. End node accepts partial failures.

p95 4.1→2.6sacc ─cost ─

✎

Re-tune extract_amount · date cluster accuracy prompt #141 · 18h

47 failures on non-ISO dates. New prompt accepts ISO/DD.MM.YY/MM/DD.

cluster 78→94%overall 86→86.4%

⇄

Swap classify_doc · GPT-4o → Haiku cost model swap ⚠ trade #140 · 1d

cost −12%, acc −1.1pt. Beam: don't ship.

cost −12%acc −1.1pt

Recent activity · 4 shipped · 1 rolled back

Mon auto accuracy parse_date · prompt 78→94% on cluster

Sat ↺ cost classify_doc · swap rolled back −11% acc on multi-currency

Answers

"What needs my call right now?"

When it wins

Multi-axis trade-offs (one card carries both deltas); autopilot day

Brief

P6 P7 P8 ✓

Cockpit vs Inbox — when each wins

Use case	Cockpit (B1)	Inbox (C1)
"Am I on track?"	Wins	Briefing helps, scroll needed
"What needs my call?"	Pending buried in tile	Wins — list IS the page
Multi-axis trade-offs	Span 2 tiles awkwardly	Wins — one card, both deltas
First-time owner	Wins — empty axes guide setup	Empty inbox is a dead end
Autopilot day	Tiles feel static	Wins — Activity is the story

Pick · both ship as primary · locked round 1 Monitor hub and Inbox are partner surfaces, not "main + alternative." Hub is "what am I tracking and how is each monitor doing." Inbox is "what needs my call right now." Both reachable as sub-nav tabs from the per-agent Learning route. Hub's summary footer deep-links to Inbox; Inbox stat tiles link back to Hub per monitor.

C2 · Finding detail — 3 variants · awaiting pick

What happens when the user clicks a finding from the C1 inbox to review it. This is the HITL decision screen — where the user actually says ship / reject. The form is the same in every variant (diff + validation + decision bar); what differs is where the screen lives and how it interrupts the inbox.

Variant A

Full-screen review page · dedicated URL

In play

Invoice agent›Learning›Findings›#142

Search…

← Findings Parallelize classify_doc + extract_amount latency · graph reorg #142 · 3h ago

Proposed change

before

fetch_pdf → classify_doc → extract_amount → end

after

fetch_pdf → ┬ classify_doc ┐ → end
└ extract_amount ┘

Predicted impact

p95 latency

4.1s → 2.6s

▼ −1.5s

accuracy

86% → 86%

no change

cost/task

$0.34 → $0.34

no change

Validation

✓ golden_v3 · 412 examples · 0 regressions
✓ 7-day replay · 1,247 production runs · 0 partial-failure issues
✓ Risk · medium · graph topology change · auto-rollback armed at −5pt

3 more findings waiting · this one is highest impact

Bet

Maximum room for evidence — diff + impact + validation all visible at once.

Weakest at

Click cost. Back-and-forth between inbox & detail = lots of nav.

Brief

P5 ✗ · too deep per your "I don't like clicking too deep" feedback.

Variant B

Slide-over drawer (50% width) · inbox stays visible

Carry · likely pick

Invoice agent›Findings

Search…

Inbox · 3 pending

⇆

Parallelize classify_doc…#142

latency · graph reorg · 3h

✎

Re-tune extract_amount…#141

accuracy · prompt · 18h

⇄

Swap classify_doc…#140 · ⚠

cost · model swap · 1d

Parallelize classify_doc + extract_amount #142 · ✕

latency graph reorg

DIFF

fetch → classify → extract → end

fetch → ┬ classify ┐ → end
└ extract ┘

IMPACT

p95 4.1s → 2.6s ▼ · acc — · cost —

VALIDATION

✓ golden_v3 · 0 regressions
✓ 7d replay · 1,247 runs
⚠ Risk: medium · auto-rollback armed

Bet

Flat. Inbox stays visible left, decision happens on the right. Closing the drawer drops you back at the inbox with no nav.

Weakest at

Less room for evidence than a full page; reviewer needs to scroll inside the drawer for long validation lists.

Brief

P5 ✓ · Matches your "don't make me click too deep" feedback.

Variant C

Expand-in-place · other inbox rows collapse

In play

Invoice agent›Findings

Search…

Findings · 3 pending

#141 Re-tune extract_amount · accuracy · prompt 18h · click to review

Parallelize classify_doc + extract_amount latency graph reorg #142 · 3h ▴

before / after

fetch → classify → extract → end
fetch → ┬ classify ┐ → end

impact

p95 4.1s → 2.6s
acc · cost unchanged
✓ golden_v3 · 0 regress

#140 Swap classify_doc · cost · model swap · ⚠ trade-off 1d · click to review

Bet

No page nav at all. Pure progressive disclosure inside the inbox.

Weakest at

Cramped — the diff + impact need to fit in 1 row's width. Long validation lists scroll the expanded row.

Brief

P5 ✓ · Flattest of the three. P6 mixed — less room for evidence.

Lean Variant B (drawer). Inbox stays visible on the left so you don't lose your place in the queue; the right pane has enough room for the diff + impact + validation without scroll. A and C are the polar ends — A maximises evidence room (but is deep), C maximises flatness (but cramps the evidence). B is the calibrated middle. Awaiting your pick.

Flow D · Continuous

Monitor auto-ships, escalates only when needed.

D1 autonomous visibility — what the user sees when no human decision is required. Safe changes ship without approval; anything past the "I need help" threshold escalates to the Findings inbox.

D1 · Continuous mode — autonomous visibility

When a monitor is in Continuous mode, the user isn't blocked — the monitor ships safe changes itself. But the user still wants visibility into what's happening. This is the monitor's surface when no human decision is required.

Single wireframe · concept

Continuous monitor · live activity stream + escalation strip

In play

Invoice agent›Learning›Accuracy

Search…

← Monitors Accuracy ● CONTINUOUS · autonomous

⚠ Needs your call 1 trade-off finding escalated · classify_doc model swap (cost ↓12% · acc ↓1.1pt)

86%

Goal: 95% · Continuous since Apr 19 (20d)

▲ +2.4 pt 7d · +12 pt 30d

Auto-shipped last 24h

4 changes · 0 rollbacks View 30d

04:12 auto prompt parse_date_field · re-tuned on date cluster 78 → 94% on cluster · 0 reg

02:47 auto prompt match_vendor · German handwriting prompt +0.3 pt overall · 21 ex

Yesterday memory Memory · absorbed 12 HITL corrections → stitch_multipage

Yesterday prompt fetch_pdf · retry policy update -3% transient errors

Signal inflow · 1,247 production runs · 47 HITL corrections · 412 golden replays · last 24h Tune signal sources →

Anatomy

Escalation strip (top) · KPI + sparkline · auto-shipped activity stream · signal inflow footer.

Bet

Even in continuous mode the user wants to glance at what happened. Activity stream is the answer. Escalation strip is the only time the monitor demands attention.

Brief

P7 ✓ P8 ✓ · Every shipped change has rollback inline.

Flow E · Datasets

Verification & training data the monitors use.

E1 index — three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config.

E1 · Datasets — verification & training data

The substrate the monitors verify against. Three sources: production (promoted runs), upload (JSONL / CSV), synthetic (rule-generated edge cases). Each monitor picks a dataset to verify against in its config (see A2 modal · Verify field).

Single wireframe · concept

Datasets index · table with sources + usage

In play

Invoice agent›Learning›Datasets

Search datasets…

Datasets

Beam runs replay validation against these before shipping changes. Build new ones from production runs, upload, or synthetic edges.

MonitorsFindingsDatasetsAudit

All sources Production Upload Synthetic

NameSourceExamplesPass rateUsed by

golden_v3 · curated invoices upload 412 94% Accuracy · Robust.

production_7d · rolling window production 1,247 86% Accuracy · Cost · Lat.

date_format_cluster · failures production 47 78% → 94% Accuracy

multi_currency_edges · synthetic synthetic 200 62% ⚠ Robustness

vw_handwriting · field captures upload 89 72% Accuracy

5 datasets · 2,995 total examples · last updated 4m ago (production_7d auto-refreshes)

Anatomy

Sub-nav · source filter chips · "+ Build dataset" CTA · table with name / source / examples / pass rate / used-by / open

Bet

Show usage. Each dataset says which monitors reference it — that's how the user knows which to curate vs leave alone.

Brief

P6 ✓ · Datasets ARE the evidence substrate.

Flow F · Audit

Immutable history — read-only, exportable.

F1 log — every change, every signal, every rollback. The page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"

F1 · Audit log — immutable history

Every change, every signal, every rollback. Read-only. Exportable. Filterable. This is the page someone reaches when they need to answer "what happened, who did it, when, and can we revert it?"

Single wireframe · concept

Audit log · filterable event feed with diff preview

In play

Invoice agent›Learning›Audit

Search by tool / ID…

Audit log

Every change is recorded here · read-only · exportable to CSV / JSONL.

MonitorsFindingsDatasetsAudit

Last 30d By monitor ▾ By type ▾ By actor ▾

WhenMonitorActorActionEffect

04:12 today Accuracy BBeam · auto Ship · prompt rewrite on parse_date_field 78 → 94% cluster

02:47 today Accuracy BBeam · auto Ship · prompt rewrite on match_vendor +0.3 pt overall

Yesterday 18:22 Accuracy ASAnna S. Config edit · mode Approve → Continuous graduation @ 30d stable

Sat 14:33 Cost BBeam · auto Rollback · classify_doc model swap reverted −11% acc on multi-currency

Sat 14:21 Cost MKMarc K. Approve · classify_doc · GPT-4o → Haiku cost −12%

Fri 11:08 Robust. ASAnna S. Signal · new cluster detected: multi-currency 8% of traffic

Showing 6 of 247 events · scroll for more · expand any row (▾) for full diff and signal trail

Anatomy

Sub-nav · filter chips (time / monitor / type / actor) · Export · table with when / monitor / actor / action / effect / expand-for-diff

Bet

Same row shape as the Findings activity table — visual continuity, just more columns and filterable.

Brief

P7 ✓ · Reversibility = visible rollback history; P8 ✓ · everything Beam did, visible. Updated: actor column now shows avatar + name (Anna S., Marc K.) — Beam · auto rows use a filled dark avatar so it's distinguishable at a glance.

#14 Sub-nav variants — 3 options · awaiting pick

The sub-nav itself has a fork. The current pick in #01 was object-based — but two alternatives are worth surfacing so you can react.

Variant A · current pick

Object-based · Monitors · Findings · Datasets · Audit

Pick

Invoice agent›Learning

Learning hub

Monitor your agent, decide what ships, curate data, audit history.

Monitors Findings 3 Datasets Audit

Each tab is a "thing" — matches Beam's other surfaces (Tasks, Flow, Configuration).

Bet

Consistency with Beam's other object-based surfaces. Stable mental model.

Weakest at

Doesn't surface the verb — new users may not know "Findings" = "things to decide."

Brief

P5 ✓ Beam-native.

Variant B

Job-based · Setup · Decide · Train · Inspect

In play

Invoice agent›Learning

Learning hub

Optimise your agent across what matters.

Setup · monitors Decide 3 · findings Train · datasets Inspect · audit

Each tab is a verb — the action the user is here to do. Object names are secondary labels.

Bet

Verb-discoverability. First-time users grasp "Decide" faster than "Findings."

Weakest at

Doesn't match Beam's existing surfaces. Two-word tabs are wider — TOC strip eats horizontal space.

Brief

P3 ✓ Customization-first (Setup is named); P5 mixed not Beam-native.

Variant C

Mode-based · HITL · Continuous · Reference

In play

Invoice agent›Learning

Learning hub

Two modes, two surfaces. Plus reference.

HITL Monitors 3 Continuous Monitors Datasets Audit

Mode is the spine. Reference tabs (Datasets, Audit) are demoted right.

Bet

Surfaces the trust ladder at the nav. User sees their HITL vs Continuous monitors at a glance.

Weakest at

Mode-split splits the monitors physically — Accuracy in HITL, Cost in Continuous, no single place to see "the agent." Also doesn't scale if monitors graduate frequently (re-routing on mode change).

Brief

P1 ✗ Goals-as-spine broken if monitors are split by mode.

Lean Variant A (object-based). Matches Beam's existing surfaces. Variant B's verb-discoverability is a real win for new users, but inconsistency with the rest of the platform costs more than it gains. Variant C breaks P1 (goals-as-spine). Awaiting your pick.

#15 Metrics philosophy — accuracy isn't enough

From round-2 sparring: "Accuracy is a good metric, but initially when the agent is trained on a small use case, its accuracy is gonna be high. Then as we increase the scope of the agent… its accuracy will go down, even though the agent became better."

Right. Bare accuracy rewards staying narrow. An agent at 95% on 5 task types looks "better" than one at 80% on 50 — but the second is doing 8× the work. Three ways the UI can honor this:

Option	What it does	Cost	Lean
A · Pair accuracy with coverage inline	Every accuracy number shows what it's a percentage of: `86% · 412 ex · 47 distinct task types · 8% un-categorised`. Denominator always visible.	Free — just copy changes.	Yes — prevents the trap immediately
B · Coverage as a first-class monitor	Add a 5th monitor type. User sets a goal: "handle 80% of production task types · grow by +5 task types/month." Scope expansion becomes a deliberate act, surfaced as findings.	Medium — adds one tile + one config option + new finding type.	Yes — closes the gap
C · Replace accuracy KPI with a 2-axis chart	On the Accuracy drill-in: scatter of accuracy-per-task-type × task-type-frequency. User sees "95% on common, 60% on long tail."	Expensive — needs real-data visualization.	Carry · for B2 drill-in next round

Proposed direction Adopt A + B together. A is free and fixes the framing immediately. B promotes Coverage to a first-class monitor — connecting directly to Datasets (via E4, "use this dataset to teach the agent a new task type"). Coverage findings: "new task type detected — multi-currency invoices (8% of traffic, 0 examples in any dataset). Add to scope?" — different glyph from accuracy / cost findings.

How a Coverage monitor reads on the hub

Sketch · proposed addition to A1 / B1

Coverage tile (proposed 5th monitor)

In play

Accuracy APPROVE

86%

/ 95% goal

▲ +2.4 pt 7d

measured on 412 ex · 47 task types · 8% un-categorised

Coverage WATCH

task types · 92% of traffic

▲ +5 types 30d

scope gap 8% of traffic in 3 un-known task types · Triage →

Anatomy

Accuracy tile (left) now carries a "measured on" footer with denominator context. Coverage tile (right) is the proposed new monitor — a separate measure of "how much the agent can do."

Bet

Accuracy + Coverage answer two different questions — "how often right?" and "how broadly?" — together they capture the real direction of travel.

Brief

Closes the philosophy gap. Adds 1 monitor type, 1 finding type ("new task type detected"), 1 connection (Coverage → Datasets via E4).

#16 A4 · Edit existing monitor — modal redraw (replaces S5 drawer)

Same centered modal as A2 (locked round 1) — populated with current values, edit-history strip at the top, mode-graduation hint inline, plus a small Delete monitor action in the footer that A2 setup doesn't need.

Locked · modal pattern matches A2

Edit Accuracy monitor · populated

Locked

Invoice agent›Learning

Search…

Edit Accuracy monitor ✕

Currently 86% · goal 95% · ▲ +2.4 pt this week.
ASLast edited 3d ago · Anna S. · Shipped since: 6 · Auto-rollbacks: 1

TARGET ACCURACY

95% by Jul 14, 2026

MODE

Off Watch only Human-approve (current) Continuous 🔒 12d

→ Beam: graduate to Continuous at 30d stable. You're at 18d.

COST CEILING

$0.10/task · hard cap

VERIFY AGAINST

golden_v3 · 412 examples · 94% pass▾

SIGNAL SOURCES

Production tasks

HITL corrections

Replay vs golden set

Synthetic edge cases

"I NEED HELP" THRESHOLD

Confidence drop > 10pt on any tool

New failure cluster > 5% of traffic

AUTO-ROLLBACK

Revert if -5pt on cluster within 24h

Last fired Sat (classify_doc model swap, multi-currency cluster).

What's new vs A2

Edit-history meta strip (last edited + actor avatar + shipped since + rollbacks) · "Delete monitor" action in footer (destructive, left side) · mode-graduation hint inline · "Save changes" instead of "Save & start listening."

Bet

Same modal pattern as setup — populated vs empty. One mental model, two states.

Brief

P3 ✓ P5 ✓ P7 ✓ · Replaces the stale drawer redraw from S5.

#17 E2 · Dataset detail — examples + usage + pass rate

Click into a dataset from E1. Three sub-tabs: Examples (the rows), Used by (which monitors verify against it), Pass rate (per-monitor accuracy on this dataset over time).

Single wireframe · concept

Dataset detail · default to Examples tab

In play

Invoice agent›Learning›Datasets›golden_v3

Search examples…

← Datasets golden_v3 upload · 412 examples

Examples (412)Used by (3 monitors)Pass rate

All 412 Must-pass · 47 Edge cases · 89 Currently failing · 18

IDInput snippetExpectedStatus

#1247 Invoice 4827 · DE · DD.MM.YY date format · €647.50 amount: 647.50 ✓ pass

#1248 VW PO #44831 · German handwriting · multi-line vendor: VW ✗ fail

#1249 Multi-currency · USD $200 + EUR €180 in same line flag for review ✗ fail

#1250 Standard ISO date · $1,243.00 · single line amount: 1243.00 ✓ pass

Showing 4 of 412 · scroll for more 394 pass · 18 fail · 94% rate

Anatomy

3 sub-tabs · filter chips (must-pass / edge / failing) · table of examples with pass/fail status · side-rail with Used by + Provenance

Connects to

E3 · "+ Add examples" CTA → curate flow. E4 · "Use this dataset →" CTA → validate / train / measure flow.

Brief

P6 ✓ · Evidence-substrate is visible.

#18 E3 · Curate — build / add examples to a dataset

Add examples to a dataset from three sources: promote from production runs (most common — user sees a failure in the inbox, decides "we should test for this"), upload JSONL/CSV, or generate synthetic edges. Per example: label, must-pass flag, edge-case flag.

Single wireframe · concept

Curate · promote-from-production flow (most common)

In play

Invoice agent›Learning›Datasets›golden_v3›Add

Add examples to golden_v3

Curate the dataset that drives Accuracy + Robustness verification.

From production Upload JSONL/CSV Generate synthetic

Filter production runs: Failed (47) Low confidence HITL corrected Multi-currency

IDProduction runOutcomeMark as…

#48,221 Multi-currency · USD $200 + EUR €180 failed · low conf must-pass edge

#48,234 VW PO · German handwriting · MM/DD date failed · HITL fix must-pass edge

#48,290 Mixed currency · GBP £450 standard failed · timeout must-pass edge

2 selected · adding to golden_v3

Bet

The fastest curation path is "I just saw this fail in production · add it to the dataset so it never fails silently again." That's row 1.

Connects to

From C2 (finding detail) · "Add example to dataset" action lands here pre-populated. From inbox / failed tasks · same flow.

Brief

P6 ✓ · evidence-substrate maintained by the user.

#19 E4 · Use this dataset — 3 actions

From E2's "Use this dataset →" CTA. A dataset can be used three different ways:

Action	What it does	When
Validate	Run replay now — measure pass rate without changing anything	Sanity-check after a config change. Or "is this dataset still relevant?"
Train	Use as a signal source for a monitor — Beam learns from these examples going forward	You've added new must-pass examples and want the agent to internalize them.
Measure	Calculate per-dataset metrics on the relevant monitor (shows up as a sub-section in B2 drill-in)	"How is my agent doing on multi-currency specifically?"

Single wireframe · concept

"Use this dataset" action picker

In play

Invoice agent›Datasets›golden_v3›Use

Use golden_v3 · 412 examples · 47 task types

Pick how this dataset should drive the agent.

VALIDATE

Run replay now

Measure pass rate today against the current agent. No agent changes.

ETA · 2 min

TRAIN

Use as a signal source

Add this dataset to a monitor's signal sources — the agent learns from these going forward.

Target monitor · Accuracy ▾

MEASURE

Track as a sub-metric

Surface accuracy on this specific dataset inside the monitor's drill-in (e.g. "accuracy on multi-currency"). Continuous tracking.

Refresh · daily

Already in use: Validate (3 monitors) · Train (Accuracy) · Measure (Accuracy > multi-currency)

Bet

Three cards = three distinct jobs. User picks based on intent, not action menu.

Closes the loop

This is the answer to "what do I do with these datasets?" — datasets are inert until you Validate, Train, or Measure with them.

Brief

P3 ✓ P6 ✓ · Datasets become first-class inputs to the monitor.