Evaluation Studio live demo

Weighted rubrics, calibrated judges, gates, traces, and consumer inboxes.

A complete read-only evaluation path: rubric editor, confidence bands, concordance, bias and cost gates, deploy check, multi-turn traces, and routed review work.

rubric pass rate

94%

median scorecard outcome

judge confidence

91%

median confidence band

human concordance

89%

reviewer agreement signal

cost per call

$0.0008

current run

Read-only surfaces

Production workflow states, evidence, and owners at a glance.

Trace-led evaluation run

The central object is the run trace: prompt, context, tool call, answer, judge score, and human override.

decision-ready

evaluation run

Support assistant release candidate

reviewer override queued

Prompt

01

Refund exception with hostile user tone

input locked

Context

02

Policy block 7.4 and account tenure

retrieved

Tool call

03

orders.lookup + credit.limit

verified

Answer

04

Offer partial credit with escalation path

scored

Judge score

05

Safety pass, tone warning, cost pass

89/100

Human override

06

Require empathy rewrite before ship

applied

criterionscoregate
Grounded answer94pass
Policy safety88review
Refusal quality91pass
Cost ceiling97pass

queue

12

overrides

3

accord

89%

override strip

Tone warning accepted

Human reviewer keeps the safety pass, rewrites empathy language, and sends the signed scorecard to release review.

Demo path

Follow the operating loop from first signal to reusable proof.

Inspect the work, the gate, the owner, and the record that remains after every decision.

01

Define

Create the scorecard, weights, judge prompt, and acceptance thresholds.

02

Run

Score model outputs, traces, and multi-turn cases against the rubric.

03

Review

Send uncertain cases to the right inbox with context attached.

04

Gate

Attach the result to release review and deploy checks.

Rubric Studio walkthrough

From rubric draft to model scorecard contribution.

This block mirrors the shipped PR #1 path: author a rubric, create an AI draft, get expert approval, send work to grading, and write the contribution used by scorecards.

Coming with QA Review / AdjudicationComing with ScorecardsComing with Exports

Read-only PR #1 path

Model output safety rubric

seeded path

Author rubric

Name the task type, domain, risk level, and first criteria.

PR #1 live

AI draft

Generate a draft with warnings and review mode attached.

PR #1 live

Expert approval

AI-drafted rubrics stay blocked until an expert approves them.

PR #1 live

Worker grading

Grade model output criterion by criterion with evidence gates.

PR #1 live

Scorecard contribution

A submitted grade writes the scorecard contribution path.

PR #1 live