cachet — confidence-scored merge gating for agent-written code

// the model

Three nouns. No "pull requests."

A change is a slice. A slice is reviewed in segments. The gate returns one verdict. That is the whole vocabulary.

slice

A unit of change you open (commit, push). cachet's "PR." Carries a score and a list of what it's waiting on.

segment

A reviewable sub-chunk of a slice. Reviewed in parallel; a finding names one; you fix only it and re-vet only it.

gate

The single decision: merge · revise · blocked · review · pending. Aggregates confidence, CI and every reviewer into one verdict.

// the cli

The agent's whole life runs through one binary.

One command opens the slice, follows it, and lands it. The exit code is the verdict, so an agent branches on it directly, with no jq-and-while pipeline to babysit.

# open from the pushed ref, stream progress, land on a clean verdict
$ git push && cachet slice open --follow
  slice_8f3 · confidence mid · waiting: ci, review:coderabbit · bar: high
  ✓ ci.passed                47s
  ! review:coderabbit        1 finding on seg_b · confidence mid→low
  → fix seg_b, git push      re-vets only the changed segment
  ✓ resolved · confidence high · clears bar · floors pass
  ✓ merge → queue → landed on main    0 humans · signed + logged

# exit code = verdict. want per-event hooks? `cachet slice watch --json` streams NDJSON.

exit	verdict	the agent does
0	merge / landed	done: admitted to the queue, or landed on main
10	revise	fix actionable findings (or queue eviction), push, re-vet
11	blocked	hard floor failed (secret, security); must change
12	review	human required; escalate, stop
13	waiting	CI or reviews still outstanding; keep watching

fix is edit + git push. no patch command. pushing to the slice re-vets only the segments that changed.

// scoring

Honest about what a model can and can't judge.

cachet reports confidence in plain bands, None, Low, Mid, High, Excellent, calibrated against real outcomes: a "High" means the change lands clean about 90% of the time, measured by reverts and incidents. Most of the risk signal is computed; the model judges the soft part; the hard cases escalate. (A continuous value stays under the hood for ordering and config.)

detDeterministic signals, no LLM. Blast radius via static call-graph, test-delta on changed lines, churn, dependency and secret scan, path criticality. Reliable and cheap.
locA local model triages the easy ~80%. Description plus soft correctness judgment, on your own hardware, at zero cost. Auto-merges low-risk, well-tested, low-blast slices.
escHard cases escalate to a frontier model. High blast radius, low confidence, or core logic goes to a stronger model, a plugin, or a human. Nothing self-certifies a high-stakes merge.
calCalibration is load-bearing. Overconfident raw scores are re-mapped through measured revert and incident rates, so the threshold maps to real risk rather than a model's self-report.

reviews are optional. built-in or plugins (CodeRabbit, Greptile, Snyk): zero, one, or many. they feed the confidence; they don't gate it unless you ask.

// config

One file. Confidence-only by default; require a reviewer when you want a floor.

# cachet.yml
auto_merge:
  require: high              # auto-merge at this confidence or better
  require_pass: [security, secrets]   # hard floors, independent of confidence
review:
  plugins: []                # [] is confidence-only. add [coderabbit, greptile] to enrich
  require_review: []         # [coderabbit] makes a passing review a floor
paths:
  "infra/**": { require: excellent }   # only the highest confidence lands unattended
  "docs/**":  { require: mid }         # cheap to land

// trust

Built for a world where the author is an untrusted program.

Unattended merge is only defensible if the chain under it is stronger than a green checkmark.

→Merge-queue only. No direct merge, ever. The queue tests the combined state before landing, so a green slice that breaks main can't happen. Speculative batching, auto-bisect on failure.
→Attested builds. CI runs in TEEs; secrets release only to an attested workload image. A poisoned runner can't read a token out of memory, so provenance alone is no longer the trust boundary.
→Keyless-signed, tamper-evident. Every commit, score, verdict and merge is signed (Sigstore, Rekor) and appended to a Merkle log. An auto-merge is cryptographically reconstructable after the fact.
→Non-human identity, done right. Agents hold no long-lived secret: short-TTL, sender-bound tokens, down-scoped per operation, every action traceable to a model, prompt and run.