Implementation spec for a document-trained, gap-aware, version-controlled credit decisioning platform. Grounds the existing v2 debate architecture in hierarchical regulatory knowledge (EU → BG → practice) with a closed-loop human feedback system and a regulator-ready audit trail.
The first draft of this spec described Phase 0 (knowledge layer) and Phase 1 (citations, gaps, maker/checker, manifest) as the MVP. Both shipped. Since then we've added two operator-facing features and tightened the agent runtime. Read this first — the rest of the spec still describes the original target architecture, with the additions noted in their respective sections.
Citation chips on the analyst page used to render
f7f7d0f0 · 87% — the first eight chars of the chunk
UUID and the relevance score. Useless to a human; actively
confusing in a video demo. Now they render
EBA-GL-2020-06 · §5.2 · 87%.
Where the metadata comes from. The retriever
already returns each chunk's sourceId,
section, and breadcrumb on the
RetrievedPassage shape. The agent's structured
JSON output only carries chunkId (plus optional
relevance + quote) — that's all the
agent needs to choose a citation. The orchestrator does the
last-mile enrichment: enrichCitations() walks the
agent's citation array, looks each chunk id up in the
passagesForFactor set that was retrieved for that
factor, and stitches in sourceId,
section, breadcrumb before BOTH the
SSE emit and the persisted factorDebates.turns
entry. Live viewers and refreshed-replay viewers see the same
breadcrumbs.
Hallucination guard. If the agent invents a
chunk id not in the retrieved set, the lookup misses and the
citation passes through with only the raw chunkId. The
frontend's CitationChip falls back to
chunkId.slice(0, 8) in that case AND switches the
chip tone to badge-warning so the operator can see
at a glance that the citation isn't grounded. Tooltip explains.
Display fallback chain. Each chip picks the most-specific label it has:
sourceId + section →
EBA-GL-2020-06 · §5.2 (preferred)sourceId only →
EBA-GL-2020-06 (chunker didn't capture section)breadcrumb only →
Chapter 5 > Article 5.2 (older chunks)chunkId.slice(0, 8) → fallback for unenriched
/ hallucinated chunks; chip switches to warning tone
Click-through behaviour is unchanged: the chip still opens the
existing CitationDialog with the source passage,
agent quote, relevance score, and a deep link to the source PDF
/ URL when present.
Each FactorBlock now renders the agents' debate as a
proper visual artifact — not a list of italic prose with
+ / – prefixes. The chronological turn list
is partitioned by role
(partitionFactorTurns()) into approver arguments,
rejector arguments, judge verdict, guidance, and per-agent
clarifications. The two argument cards then sit in a
grid grid-cols-1 md:grid-cols-2 gap-4 with the judge
verdict full-width below.
AgentArgumentCard is the hero. Approver gets a
success-tinted left border (border-l-4 border-success/70
bg-success/5) and Rejector gets the error tone — colour
alone tells the operator who's arguing what within a glance. The
header carries an SVG ConfidenceRing (44px,
stroke-dasharray driven from the agent's self-reported
confidence; the ring colour matches the agent's tone). The body
is the argument prose at text-sm leading-relaxed —
readable, not crammed. Footer surfaces the three-state grounding
badge and citation chips on a divider row, so the regulatory
grounding story is part of every argument the operator scans.
AgentArgumentSkeleton covers the initial-stream
case: the rejector's column shows a skeleton card with a pulsing
"thinking…" cue while the approver lands first, so the layout
doesn't reflow when the second argument arrives. The skeleton is
gated by the SSE phase event — only pulses when the
orchestrator is actually mid-call on that agent.
JudgeVerdictCard renders below the two columns once the per-factor judge has picked. Verdict tone (positive / negative / neutral) drives the wrapper colour so the page reads like a sequence of debates with clear outcomes. Judge runs after both arguments land, so this card never shows a skeleton — the slot stays empty until the verdict exists.
Multi-round clarifications. If an agent ran more
than once (initial low-confidence answer → operator clarification →
refined answer), only the LATEST argument is the hero card. A
small round X of Y chip in the card header signals
the argument was refined. The system / human clarification turns
sit chronologically below the card in the same column, keeping
the mini-thread under its parent agent rather than floating
between the two debate sides.
Timeline placement. The DebateTimeline moved
from a top banner above the factor stack into a sticky right
sidebar (w-72, lg:sticky lg:top-20).
The analyst page is now a two-column flex
(lg:flex-row lg:items-start): main column with
factor cards / final decision / panels, and an aside that
stays pinned to the topbar as the operator scrolls. On
screens narrower than lg the layout collapses
and the timeline drops above the main content via
flex-col-reverse. Outer container widened from
max-w-3xl to max-w-6xl to give the
sidebar room without squeezing the factor cards.
Solid badges. Every badge-soft
modifier across the app — Debates, Clarifications, Evals,
Corpus, the analyst page, the timeline, the inbox — was
stripped. The plain badge badge-{semantic}
classes now render as filled coloured pills with the theme's
--color-*-content as text (light text on the
coloured background; warning is the deliberate exception
because corporate's warning-content is dark, on yellow). The
previous tinted look was inverted from the corporate theme's
intent.
New DebateTimeline component renders a vertical
daisyUI timeline (timeline timeline-vertical timeline-compact
timeline-snap-icon) above the factor stack on both the live
and replay analyst views. One row per factor plus a final
"Aggregation" step. Each step has three states: pending (hollow
base-300 dot), active (filled primary dot + animated
loading-dots sub-label like "Approver thinking…"), or
done (filled primary check icon). The connector segment between
two done steps colours primary; otherwise base-300. State derives
directly from the existing SSE phase event + the
per-factor verdict the SSE hook already tracks — no
new backend events.
Theme alignment sweep across the analyst chrome.
The Re-run button moved from a hand-rolled bordered pill to a
proper btn btn-outline btn-sm with an iconify rotate
icon and inline spinner — picks up the corporate theme's square
corners. Every text-gray-N, bg-emerald-50,
border-blue-200, bg-violet-50, etc., on
the analyst page and its companion components
(FactorDebateAccordion, ReviewPanel, FinalDecisionCard,
StreamingProgressBar, ClarificationInlineForm, CitationDialog,
GapsPanel, ManifestPanel, AgentCard) was bulk-converted to daisyUI
semantic tokens (text-base-content,
bg-success/10, border-info/30,
bg-secondary/10). Stray rounded-md /
rounded-lg / rounded-full utilities
stripped — the theme's --radius-*: 0rem tokens are
now load-bearing across the whole debate view. "High Risk" and
"Needs review" pills swapped to badge badge-error
badge-soft / badge badge-warning badge-soft.
/decision/new used to be a single JSON editor with six
preset buttons up top. It is now an operator-shaped inbox:
a daisyUI table where each row is a credit case with applicant
signals at a glance (income, credit score, DTI%, loan amount,
employment + purpose snippet) and per-row Run / Edit / Delete
actions. Last-run status renders as a click-through link to the
analyst replay so the operator can re-open whatever a case
produced last.
The case data layer (src/lib/cases.ts) seeds the six
original presets as immutable rows and stores any operator-
created drafts in localStorage. A second
last-run map (also localStorage) keeps the per-case
session id + timestamp so the inbox never has to call the
backend to render its status column.
The JSON editor moved to a secondary route at
/decision/new/case/[id] — three modes: "new"
creates a blank custom case, preset-… renders read-only
with a "Save as new case" button (presets stay pristine), and
custom-… is fully editable with Save / Run / Delete.
Structured form fields up top, raw JSON editor collapsed below
for paste-from-outside flows. Sidebar label updated to "Cases"
(lucide-inbox icon). Backend untouched — Run still POSTs to
/decision exactly as before.
Three control + visibility upgrades shipped together.
Cancel debate. New POST /decision/:id/cancel
endpoint sets a per-session flag in an in-process Set; the orchestrator
checks the flag at safe checkpoints (between factors, before each agent
call, before final aggregation) and throws a CancelledError.
The outer catch identifies it via the brand and takes a different path
from real failures — no rule-based fallback, just emit a
'cancelled' SSE event, mark the session
CANCELLED (new SessionStatus variant), drop
any pending clarifications so an awaitClarification()
currently blocking resolves immediately, write the manifest, exit.
Frontend Cancel button lives in the analyst topbar and unmounts the
moment the SSE acknowledgement lands.
Live activity indicator. The StreamingProgressBar
is gone. Replaced by a daisyUI loading-dots chip in the
analyst page header ("Approver thinking on Income vs Loan Amount…")
plus an inline placeholder turn that mounts inside the
FactorBlock matching the currently-active factor. Both
share state via a new 'phase' SSE event the orchestrator
emits at every transition (retrieving /
thinking / aggregating / idle).
Stateless — every emission overwrites the prior value; the SSE hook's
phase field drives both views off one source of truth so
they never desync.
DaisyUI corporate theme. The two custom themes (light
+ dark) were replaced by daisyUI's stock corporate
theme — neutral whites/grays, confident blue primary, square corners,
no shadows. Light-only for now; dark mode is a follow-up if a
customer asks. The legacy ThemeToggle remains in the UI
but currently a no-op since only one theme is registered (cycling to
"dark" falls back to the default). Removing the toggle is a small
polish task.
The frontend has been rebuilt on top of a purchased
Scalo daisyUI/Tailwind 4 template (Denish Navadiya). Every
page now renders inside a consistent shell: a sidebar + topbar
AppShell for the authenticated views, a sticky-blur
LandingTopbar for the public site, daisyUI tokens for
every colour/badge/button/alert. New ConfigContext +
ThemeToggle drive light / dark / system theme via the
data-theme attribute on <html>;
daisyUI reads it and swaps colour tokens.
Stack changes: Tailwind 3 → 4 (CSS-based
@theme config, no JS config needed), added
daisyUI 5, @iconify/tailwind4 with the
Lucide icon set, tailwindcss-motion,
simplebar-react, swiper. PostCSS plugin
swapped to @tailwindcss/postcss. Next.js stays at 14,
React stays at 18 — the template's stack is forward-compatible and
the upgrades aren't required.
New routes: / is now the public
landing (Hero + Process + Features + Capabilities + Pricing + CTA);
the demo decision form moved to /decision/new. Existing
routes (/admin, /admin/login,
/analyst/decision/:id, /decision/:id) are
wrapped in the new layout but keep the same logic, hooks, and SSE
flows — no functional regressions. Inline-clarification flow,
citation chips, eval dashboard, maker/checker review, version
manifest viewer all still work; only chrome changed.
Anthropic does not ship an embedding API of its own and recommends Voyage. We default
to voyage-3-lite, 512-dim, with a 200M-tokens/month free tier — comfortably
more than enough for the entire current corpus. The knowledge_chunks.embedding
column is VECTOR(512) after migration 012_voyage_embeddings.sql.
Adapter is a 30-line REST shim in backend/src/knowledge/embedder.ts
— no SDK dependency.
The original Phase 1 spec described a green Grounded / amber No grounding binary. Real debates exposed a third state: the retriever DID return passages but the agent declined to cite any. Conflating "no corpus material existed" with "agent chose not to lean on it" was misleading. The badge is now:
Backend emits retrievalCount on every factor_turn event so
the UI can compute the three states without an extra round-trip.
New in this iteration. When an Approver or Rejector returns confidence < 0.65
AND populates clarification_request: { question }, the orchestrator pauses
the debate, emits a clarification_request SSE event, and awaits a human
answer (or skip / 5-min timeout / cap). The answer is appended to the prompt and the
agent re-runs.
Cap is configurable (default 2 rounds per agent per factor). Both the question and the
answer are persisted as system / human turns in the factor's
timeline, alongside the agent arguments — full audit trail. See section 5
(Clarification Flow) below.
The pause prompt no longer pops up a backdrop dialog. The system question already lands
in the timeline as a speaker: 'system' turn; the answer form now mounts
right under it as part of that same blue mini-thread. Same submit/skip behaviour, same
409-on-late-answer handling, same SSE-driven unmount when
clarification_resolved fires — but the operator's eyes never leave the
debate. New ClarificationInlineForm component; the old
ClarificationModal is removed.
/adminNew in this iteration. Lives behind a shared bcrypt-hashed password + signed cookie. Two tabs:
runtime_settings
table — no redeploy needed.See section 6 (Operator Console) below.
New runtime_settings table (migration 013) stores tunable values keyed by
the same names as the env vars they replace. The retriever and orchestrator read through
a 60s in-memory cache; writes invalidate immediately on the writer's machine and
propagate via TTL to others. Static config.ts values become fallbacks for
unset rows. This is what the Settings tab edits.
Every citation chip in the analyst view is now a button. Clicking it opens a modal with
the full retrieved passage, source ID (linked to the original PDF if a
source_url was provided at ingest), tier badge, jurisdiction, version,
breadcrumb, and the agent's quoted snippet for comparison. Backed by
GET /knowledge/chunks/:id.
LLMs occasionally produce malformed structured output — overshoot the
quote length cap, wrap the JSON in ```json fences, jam two
quotes together with " and ". The orchestrator now wraps each agent call
in a per-factor try/catch: if parse still fails, that one speaker degrades
to a stub with known=false, confidence=0 and the debate continues. The
sanitiser strips fences and collapses the multi-quote pattern before Zod sees it.
The original spec described an AWS deployment topology. For partner-testing we shipped
on Fly.io instead — two apps (frontend, backend) + Neon Postgres for pgvector. ~30
minutes to deploy, ~$0/mo on hobby tiers. Dockerfiles and fly.backend.toml
/ fly.frontend.toml live at the monorepo root. DEPLOY.md
contains the full runbook. AWS path remains the eventual target; Fly is the proving ground.
The factor list is the same five from v2: Income vs Loan Amount, Credit Score, Existing
Debt / DTI, Missed Payments, Employment Stability. The orchestrator's "skip remaining
factors after N negative verdicts" optimisation is now keyed off the
EARLY_EXIT_NEGATIVE_THRESHOLD runtime setting (default 3). Set it to a
value ≥ 5 in the admin panel to force a full sweep — useful for analyst review at the
cost of double the LLM tokens on rejections.
Every clarification interaction is recorded in a new
clarification_events table (migration 015): question, answer,
status (answered / skipped / timeout /
capped), confidence_before,
confidence_after backfilled when the agent re-runs, computed
confidence_delta as a generated column, and the wall-time the
operator took to respond. Reserved analyst_feedback column is
stubbed for the next phase.
A new admin Clarifications tab surfaces these events with per-factor / per-status filters and four roll-up stats (total, answered %, skipped+capped, average confidence Δ). Read-only for now; analyst feedback marks and similarity-based prompt injection (the actual "learning") are the next two steps. The data we collect now is what makes those work — building the substrate before the smarts.
The first Bondora eval (20 rows) revealed the system rejected 90% of applicants and never produced an APPROVE via the LLM debate — both approvals in the run came from the rule-based fallback. Defaulter recall was 92% (great); good-loan recall was 14% (commercially non-viable). Diagnosed three structural causes in the prompts:
3+ positives AND 0-1
negatives → APPROVE; 3+ negatives AND 0-1 positives
→ REJECT; everything else → REVIEW. Neutrals don't count
as negatives. Default-to-REJECT bias is explicitly called out
as a failure mode; REVIEW is the safe action under uncertainty.
Operational change to make alongside the prompt rebalance: set
EARLY_EXIT_NEGATIVE_THRESHOLD to 99 (or any value ≥
FACTORS.length) in the admin Settings tab. The default (3) was
bailing out before the positive-leaning factors at the END of the
list (Missed Payments, Employment Stability) ever got a vote.
Cost: doubles LLM tokens on rejections. Benefit: positive signals
get heard.
Re-run the Bondora eval after deploying these changes. Expectation: the LLM should now produce some APPROVE decisions on rows where credit + DTI + clean history clearly support it (e.g. row 11 from the first run — credit 720, DTI 9.9%, employed, 0 missed).
Bug fix surfaced via re-runs. Previously, the version manifest
was only written in the orchestrator's success path. If the
debate threw — for any reason — the rule-based fallback engine
produced a decision but no
decision_version_manifest row got written, leaving
an audit-trail hole for exactly the cases that need it most.
The settings snapshot (prompt-set hash, guardrail-set hash,
retrieval params, bundle IDs touched) is now hoisted out of
the inner try block and a single writeManifest()
helper is called from BOTH the success path AND the fallback
path. Best-effort either way — a manifest-write failure logs
but never affects the decision delivered to the user.
First version of the guidance-injection feature put the prior
Q&As into the prompt invisibly — the operator couldn't tell
whether the system had used institutional memory or just got
lucky. Now the orchestrator emits a new
factor_turn with clarificationKind: 'guidance'
+ a guidanceItems array right before the approver runs.
The analyst view renders it as a violet "📚 Prior operator guidance
· N items" row — collapsed by default, click to expand the actual
Q&A pairs the agents had access to. Persisted into
factorDebates.turns so refresh / replay shows the
same context.
First piece of the clarification learning loop's "actual learning"
phase. Before each factor runs, the orchestrator queries
clarification_events for recent ANSWERED Q&As on
the same factor (default lookback 30 days, top 5 by recency,
filtered to rows where the answer measurably moved confidence).
These are rendered into a new HISTORICAL OPERATOR GUIDANCE
block in both Approver and Rejector prompts. The agents are
instructed to treat the operator's standing answer as institutional
policy and stop re-asking what's already been resolved.
Tunable via three new runtime settings (migration 017):
GUIDANCE_INJECTION_ENABLED (default on),
GUIDANCE_LOOKBACK_DAYS (1–365, default 30),
GUIDANCE_MAX_ITEMS_PER_FACTOR (default 5). Set the
first to false for stateless A/B-test runs against
the original behaviour.
Where this fits in the wider learning roadmap: this is step 1 of 4 (recent guidance injection). Steps 2–4 — analyst feedback marks, question quality scoring, similarity retrieval — wait for more labelled data to be worth building.
Small UX fix. The RerunDebateButton component now
accepts a sourceRunning prop fed from the SSE hook
— true while there's no final_decision and no error
— and renders disabled with a tooltip until the original debate
settles. Prevents a class of double-debate confusion on the same
applicant.
Migration 016 adds application_payload JSONB to
debate_sessions. The POST /decision handler now
persists the full application JSON on the session row at creation time.
New POST /decisions/:id/rerun route reads it back, claims a
fresh session (deliberately skipping idempotency so each rerun is its
own row), and launches a new debate on the same applicant. The analyst
page's "Re-run Debate" button — previously a permanently disabled stub
across four hard-coded copies — is now a single
RerunDebateButton component wired to the new endpoint, with
a graceful 409 path for sessions that predate the migration.
Useful beyond the button: per-decision A/B testing (change a setting, re-run, compare), and the eval harness can now self-replay against new prompts without touching the source dataset.
New Debates tab in the admin console lists every
debate_sessions row newest-first, joined with the four-eyes
review status, with filters by session status and final decision.
Click-through opens the analyst route in replay mode
(/analyst/decision/:id?replay=true) which renders the
full timeline from the persisted judge_output — including
the system/human clarification turns we now save inline.
New npm run eval:bondora CLI reads the public Bondora P2P
loan-data CSV, maps each historical loan to a CreditApplication,
runs the orchestrator, and scores the agent's decision against the realised
outcome (Repaid vs Default). Outputs a confusion-matrix scoreboard with
accuracy, false-approval rate, expected loss per €1000 lent, and average
debate latency. First time the system has been measured against ground
truth instead of vibes. See section 5.4 for the metrics list and a data-leakage caveat.
The eval CLI has graduated to an admin-panel feature. New
Evals tab at /admin with three blocks: a
dataset list (CSV uploads with a per-dataset cursor showing X/Y rows
processed), a run launcher (pick dataset + row limit, fire), and a runs
history. Click any run to see its confusion matrix and every per-row
result, with a one-click link into the live debate replay for the
underlying session. New tables: eval_datasets,
eval_runs, eval_run_rows (migration 018).
Cursor-based, no duplicates. Each run consumes the next N
mappable rows past dataset.processed_count, advances the
cursor at the end, and records the slice (cursor_start →
cursor_start + row_limit) on the run row. Hitting "Start" five
times in a row scores rows 0-19, 20-39, 40-59, 60-79, 80-99 — never the
same row twice. Failed and unmappable rows still consume the cursor (the
LLM cost was paid; replaying won't help). Re-uploading the same CSV is a
no-op via SHA-256 dedupe, so the cursor isn't lost.
The runner is the same code path as the existing CLI (extracted into
backend/src/eval/{csv,datasets,runner}.ts), kicked off
async via setImmediate so the POST returns the run id
immediately. The UI polls every 4 s while any run is queued/running.
Backend stdout now uses ANSI-colored category tags ([db], [llm],
[embed], [route], [gaps], [retrieval],
[orch], [manifest], [ingest], [citation],
[admin], [eval]) plus magnitude-graded duration colors (green <100 ms, yellow
<1 s, red ≥1 s). Auto-disables on non-TTY pipes; force on with FORCE_COLOR=1.
A junior analyst asks senior staff when stuck. The Approver and Rejector now do the same. Each agent's structured output gained an optional fourth field:
{
argument: string,
known: boolean,
confidence: number (0..1),
citations: Citation[],
clarification_request: { question: string } | null // NEW
}
The agent populates clarification_request when it would otherwise have to
guess — specifically, when there's a focused, answerable thing a senior credit/risk
officer could tell it that would change the argument. The system prompts include
examples of good vs bad questions ("Does our policy treat 6-month gig income as stable?"
vs "Is the credit score good?").
When the orchestrator sees confidence < CLARIFICATION_THRESHOLD (default
0.65, runtime-tunable) AND a non-null clarification_request, it:
factor_turn with
speaker: 'system' + the question (timeline persistence), and
clarification_request with round metadata (which the analyst view's
useDebateStream hook stores keyed by
${factor}::${speaker}).awaitClarification(), which registers a Promise
resolver in an in-memory map keyed by
(sessionId, factor, speaker), with a 5-minute timeout.ClarificationInlineForm directly underneath the system-question turn —
no modal, no backdrop. The form lives inline in the blue mini-thread. Human types an
answer or hits Skip./decision/:id/clarify with
{ factor, speaker, answer, reason } (JSON) or, when the
operator attaches evidence, the same fields plus an attachment
file part (multipart). The route handler calls
resolveClarification() which fires the Promise.clarification_resolved event (the
hook removes the pending entry, which unmounts the inline form) plus a
factor_turn with speaker: 'human' + the answer (timeline
persistence), then re-runs the same agent with the Q&A appended to the prompt.MAX_CLARIFICATIONS_PER_FACTOR, default 2) is hit.
Per-decision timeline: question and answer are first-class turns
in factor.turns[], with speaker: 'system' | 'human' and a
clarificationKind: 'request' | 'response' tag. They render as a
blue-bordered mini-thread in the analyst view ("✋ System asked" / "🗣 Operator
answered" + reason badge); while a request is still pending, the
ClarificationInlineForm mounts inside that same mini-thread directly
under the question, so the operator answers in-context. Turns persist into
judge_output.factorDebates so a refresh shows the same chronological
story. The version manifest is unaffected — clarifications are operator interventions,
not corpus material.
Cross-decision dataset: every clarification also writes a row
to clarification_events (migration 015) capturing
session_id, factor, speaker,
round, question_text, answer_text,
answer_status, confidence_before,
confidence_after (backfilled when the agent re-runs),
confidence_delta (generated column), and
time_to_answer_ms. The admin Clarifications tab reads from this
table; the next phase of the learning loop uses it for analyst feedback marks
and per-factor "recent guidance" prompt injection. Writes are best-effort —
failure logs but never blocks the debate.
The clarification form also accepts an optional file alongside the
typed answer — used when the borrower has supplied evidence (payslip,
bank statement, ID photo, screenshot of HMRC tax record). One file
per answer, capped at 10 MB, MIME restricted to
application/pdf, image/jpeg,
image/png, and image/webp. Both client and
server enforce the cap; oversized uploads return 413,
wrong types return 415.
Files land on the backend's persistent volume (CLARIFICATION_ATTACHMENTS_DIR,
e.g. /data/clarification-attachments/{sessionId}/{slug}),
and migration 020_clarification_attachment.sql adds four
nullable columns to clarification_events:
attachment_path, attachment_mime,
attachment_size_bytes, attachment_original_name.
A CHECK constraint enforces all-or-nothing — a half-populated row
can't slip through.
How the LLM consumes the file: when the orchestrator
re-prompts the agent on the next round, an ATTACHED EVIDENCE
block is synthesised into the operator's answer. For PDFs the file
is run through pdf-parse and the extracted text is
inlined (capped at ~12k chars, with a "…[truncated]" marker), so
the agent reads the actual document content. For images the agent
sees only a labelled note (filename + size + "the operator attached
visual evidence; rely on the typed answer") — vision-aware re-prompts
are a future upgrade. PDF parse failures fall back to the
image/note path so a corrupted upload never kills the debate.
Audit surfaces: the human-response timeline turn
gains an attachment field ({eventId, mimeType,
sizeBytes, originalName}) so the analyst page renders an
inline link to the file alongside the typed answer. The admin
Clarifications tab renders the same link on each event card. The
backend exposes
GET /decision/:id/clarifications/:eventId/attachment
which streams the file with Content-Disposition: inline
so PDFs and images open in a new tab. The agent prompt + the
auditor's view stay decoupled — the LLM never sees image bytes;
humans always see the original file.
In-memory only. Works because Fly's request affinity keeps the SSE stream and the
/clarify POST on the same machine. If we ever shard runDebate
across machines, this graduates to Redis pub/sub or a DB-backed coordination primitive.
For two-operator testing on a single Fly machine, the in-memory map is fine. Attachment
storage carries the same single-machine assumption — when we shard, the volume
mount becomes S3/R2.
Lives at /admin on the frontend. Five tabs (Corpus, Debates,
Clarifications, Evals, Settings), shared bcrypt-hashed password, signed httpOnly
cookie session. Replaces the prior workflow where every corpus tweak required SSH
+ CLI + restart.
npm run admin:hash -- 'password' locally → bcrypt hash.ADMIN_PASSWORD_HASH env on Fly. Plaintext never touches the server.COOKIE_SECRET (generated by openssl rand -hex 32).requireAdmin Fastify preHandler.COOKIE_SECRET invalidates every existing session.
Table of every knowledge_bundle row joined with chunk metadata
(representative tier / jurisdiction / source_id / source_url / version pulled via
MIN(c.<col>) across the bundle's chunks). Source ID is linked to the
external source_url when present. Inline confirm-delete cascades to chunks
via FK.
Upload form: PDF picker, tier dropdown with inline explainer ("Tier 1 = regulator,
Tier 4 = academic"), jurisdiction text, source-id text, optional version + source-url.
Submits multipart to POST /knowledge/upload, which streams the file to
/tmp, runs the existing ingestDocument() pipeline (loader →
chunker → embedder → DB writer), and returns the standard IngestReport.
Authenticated callers only — an attacker can't fill our DB or burn our embedding tokens
via curl.
The tab now ships in three sections, all reading from the same
runtime_settings table:
type: toggle for
boolean, slider for known-bounded numbers,
text input otherwise, jurisdiction dropdown for the namesake key.LLM_MODEL renders a
whitelist dropdown of pricing-table-known Claude models (Haiku 4.5,
Sonnet 4.x, Opus 4.x) with per-token cost displayed inline so the
operator can compare. Empty value means "use the env-derived default."
llmClient.ts caches the built ChatAnthropic / ChatOpenAI
client by (provider, model) tuple and rebuilds when the operator's
edit lands — no redeploy.APPROVER_SYSTEM_PROMPT etc., seeded by migration 023).
Each shows a default / custom badge, character count,
and a hidden-by-default disclosure showing the in-code default text
(fetched from GET /admin/prompts/defaults). "Copy default
into editor" forks the canonical text; "Clear override" empties the
textarea so saving re-engages the default. The orchestrator's
computePromptSetHash() reads from the same source so
every edit bumps the manifest's prompt_set_hash.
Every row shows the plain-English description seeded by migration 013 / 014. Save
button activates only when the value differs from the saved one. Updates POST to
PUT /admin/settings/:key; the response writes the new
updated_at + updated_by fields, displayed under the widget.
Currently exposed knobs:
RETRIEVAL_ENABLED,
RETRIEVAL_TOP_K,
RETRIEVAL_MIN_SIMILARITY,
DEFAULT_JURISDICTION,
CLARIFICATION_THRESHOLD,
MAX_CLARIFICATIONS_PER_FACTOR,
EARLY_EXIT_NEGATIVE_THRESHOLD,
GUIDANCE_INJECTION_ENABLED,
GUIDANCE_LOOKBACK_DAYS,
GUIDANCE_MAX_ITEMS_PER_FACTOR,
LOG_SQL.
Newest-first listing of every debate session, joined with
decision_reviews for the four-eyes status. Filters: by
session status (Completed / Failed / Running) and by final decision
(APPROVE / REJECT / REVIEW). An "Include eval runs"
checkbox surfaces sessions produced by eval batches; off by default
so live operator activity isn't drowned by them. Backed by
GET /admin/debates?status=&decision=&includeEval=&limit=.
Roll-up strip at the top: total / approved / rejected / review /
fallback-used / cost-loaded counts over the loaded set. Each row
shows session UUID prefix, application id, status pill, decision
pill (with a fallback tag if the rule engine had to
step in, plus a from eval tag when surfaced via the
toggle), the first few decision tags, four-eyes review status,
duration, USD cost (with input/output token breakdown on hover),
and created-at. View action is an eye icon in the rightmost
column — opens replay (or live, for in-flight debates).
Eval-vs-live is tracked via the
debate_sessions.eval_run_id column added in migration
021 — eval runner sets it; the live /decision route
leaves it NULL. Cost columns are populated by migration 022 +
withUsageTracking in the orchestrator: each LLM call
records token usage to an AsyncLocalStorage scope, and the finally
block sums + persists at debate end (success, fallback, or
cancellation) using the per-model pricing table in
agents/usage.ts.
Failed debates render a translated banner instead of the raw
ERROR_TRACE when one of the well-known LLM provider
errors fires (out of credit, rate limit, invalid API key). The
friendlyLlmFailure() helper in the orchestrator
tags these into the fallback reasoning; the analyst page's
FinalDecisionCard renders a coloured alert with the
operator-readable message and tucks any unrecognised stack
trace behind a "Show debug trace" disclosure.
Click-through navigates to the existing analyst route:
/analyst/decision/:id?replay=true for completed/failed
debates (loads from persisted judge_output), or the live
streaming view for in-flight ones. No new view code — replay was already
built; this tab just makes it discoverable.
Read-only feed of every clarification event written to
clarification_events, newest first. Filters: factor (dropdown of the
five debate factors), status (answered / skipped /
timeout / capped), and an
"Include eval runs" toggle (off by default). Eval batches
run with maxRounds=0 and produce a flood of capped
rows when an agent reports low confidence; hiding them by default keeps
the operator's own clarification history readable. The toggle is implemented
as a LEFT JOIN debate_sessions ON s.id = e.session_id in
listEvents with a default WHERE s.eval_run_id IS NULL.
Cards from eval runs render a from eval badge when surfaced.
Backed by GET /admin/clarifications?factor=&status=&includeEval=&limit=.
Header strip shows four roll-up stats over the loaded set: total events, answered count + percentage, skipped+capped count, and average confidence Δ. The avg-Δ figure is the closest thing to "are these questions actually useful?" you can read at a glance — green if ≥+5%, red if negative.
Each event renders as a card: status pill + factor + speaker + round, the
question and answer text in a blue mini-thread, and a footer row with
confidence_before, confidence_after, the delta, and
the operator's response latency. Future enhancements (analyst feedback marks,
similarity-based prompt injection) build on this surface.
Three-block layout. Datasets at the top: list of uploaded CSVs
(one row per eval_datasets entry), each showing
processed_count / total_rows as a progress bar so you can see at a
glance how much of the dataset is left. Inline upload button — multipart POST
to /admin/eval/datasets, files written to EVAL_DATA_DIR
(a Fly volume in prod, OS tmp in dev), SHA-256 deduped so re-uploads no-op.
Run launcher: dataset dropdown + row-limit input + a single Start
button. Submits POST /admin/eval/runs with
{ datasetId, rowLimit }. The backend creates an
eval_runs row in queued status, fires the worker on
setImmediate, and returns the row id. The runner skips
cursor_start CSV rows (= dataset.processed_count at POST
time), then collects up to row_limit mappable rows and runs each
through runDebate(...,{ noClarifications: true }) sequentially. Per-row
results land in eval_run_rows as they complete; aggregate metrics
(TP/TN/FP/FN, accuracy, expected loss, net €) are computed and written to
eval_runs at the end. The cursor advances by the number of CSV rows
consumed (not just successful) so failed rows aren't retried on the next run.
Runs history: newest-first table of all runs (optionally filtered
by clicking a dataset above). Each row shows started timestamp, status pill, slice
indices, the four confusion-matrix counts, accuracy, and net €. Clicking a run opens
the run detail view: confusion-matrix grid up top (TP/TN/FP/FN +
REVIEW + failed + accuracy + net €) and a per-row table below with the agent's
decision, the realised outcome, the bucket, and a one-click link to the live debate
replay (/analyst/decision/:session_id?replay=true) so you can drill into
why any individual call went the way it did. While the run is queued/running the UI
polls every 4 s.
What's NOT in the MVP: agent-written interpretation of the results (deferred — we want analysts to look at the per-row data first), file attachments on clarifications, and any kind of automatic dataset selection. The cursor is per-dataset only — there is no "split into train/eval" mode.
Reads use a 60s in-memory map. Writes invalidate the map immediately on the writer's
Fly machine; other machines pick up changes on next cache miss. So a setting saved at
T+0 takes effect on the writer's debates immediately, on other machines
within ≤60s. This was the right trade-off for two-person testing — no Redis dependency,
negligible read cost.
Challenger v2 is a static multi-agent debate system (Approver → Rejector → Judge) that produces explainable credit decisions — but the intelligence lives entirely in the LLM's pretraining. There is no regulatory grounding, no memory of prior decisions, no way to improve from feedback, and no audit trail a European bank regulator would accept.
v3 keeps the debate harness and wraps it in three new layers: a hierarchical knowledge store (EU → BG → practice), a grounded reasoning layer where every argument carries citations and a self-reported confidence, and a self-improvement layer where the agent logs what it did not know, humans resolve gaps, and the system is version-pinned so every historical decision can be reproduced.
Not "smarter credit decisions." That market is commoditising. Regulator-ready AI audit trails for credit origination in the EU. Four-eyes (Maker/Checker) aligned with EBA and the EU AI Act high-risk-system obligations. The moat is the BG+CEE practice corpus, the immutable citation trail, and the gap log.
version_manifest tuple (model, prompt_set, knowledge_bundle, guardrail_set).KNOWLEDGE_GAP event that a human can resolve in the UI.| Layer | Responsibility | Owns |
|---|---|---|
| Knowledge | What the agent knows | Ingestion · vector DB · retriever |
| Reasoning | How it decides | Maker · Checker · Judge · guardrails · gap detector |
| Self-Improvement | How it gets better | Gap queue · feedback store · evals · version manifest |
| Platform | Delivery | Fastify API · SSE · Next.js UI · auth |
§, Article, Paragraph) — never fixed-window.bge-m3 embeddings which handle both; avoid per-language pipelines.
Every ingestion batch produces a knowledge_bundle row with a content hash.
Decisions reference bundle_id. Bundles are append-only; a "new version" of
EBA guidelines creates a new bundle, never mutates the old one.
knowledge_bundles
├── id UUID, primary key
├── label VARCHAR — e.g. "kb-2026-04-20-eu+bg"
├── content_hash VARCHAR — SHA256 of sorted chunk_ids
├── source_manifest JSONB — [{source_id, version, effective_from, sha256}]
├── chunk_count INTEGER
├── created_at TIMESTAMPTZ
└── created_by VARCHAR
knowledge_chunks
├── id UUID, primary key
├── bundle_id UUID → knowledge_bundles.id
├── source_id VARCHAR — e.g. "EBA-GL-2020-06"
├── source_url TEXT
├── tier SMALLINT — 1=EU, 2=national, 3=practice, 4=literature
├── jurisdiction VARCHAR — "EU" | "BG" | "DE" | ...
├── version VARCHAR — "2020/06"
├── effective_from DATE
├── section VARCHAR — "Article 4 §2"
├── breadcrumb TEXT — "Title II > Chapter 3 > Article 4 > §2"
├── language VARCHAR — "en" | "bg"
├── text TEXT
├── embedding VECTOR(1024) — pgvector
├── token_count INTEGER
└── chunk_hash VARCHAR — SHA256 of text
INDEX ivfflat (embedding vector_cosine_ops)
INDEX (bundle_id, tier, jurisdiction)
Dense (pgvector cosine) + sparse (tsvector BM25-style over text).
Regulatory language is keyword-heavy (PD, LGD, EAD, Article numbers) — embeddings alone
fumble these.
Retrieval merges tier 1 + tier 2 + tier 3 with configurable weights. BG-specific queries bias tier 2; fallback to tier 1 when tier 2 returns nothing above threshold.
Top-20 → cross-encoder reranker → top-5. Latency cost (~200ms) is worth it for
compliance. Local dev: bge-reranker-v2-m3 via HuggingFace inference.
AWS: SageMaker endpoint or Bedrock if available.
Retriever returns {passages[], max_similarity, tier_coverage, gaps[]}.
gaps[] = jurisdictions or tiers the query expected but didn't
find above threshold — fed directly into the gap detector.
// packages/shared/src/retrieval.ts
export type RetrievalQuery = {
text: string;
factor: FactorName; // "Credit Score" | ...
jurisdiction: "EU" | "BG"; // application's jurisdiction
tiers?: Array<1 | 2 | 3 | 4>; // default: [1,2,3,4]
topK?: number; // default: 5 (after reranking)
minSimilarity?: number; // default: 0.72
};
export type RetrievedPassage = {
chunkId: string;
sourceId: string;
sourceUrl: string;
breadcrumb: string;
tier: 1 | 2 | 3 | 4;
jurisdiction: string;
version: string;
similarity: number;
rerankScore: number;
text: string;
};
export type RetrievalResult = {
passages: RetrievedPassage[];
maxSimilarity: number;
tierCoverage: Record<string, number>; // tier → count above threshold
gaps: Array<{
expectedTier: 1 | 2 | 3 | 4;
expectedJurisdiction: string;
reason: "no_match_above_threshold" | "jurisdiction_missing";
}>;
};
| Area | v2 | v3 |
|---|---|---|
| Agent inputs | application, factor | application, factor, retrieved_passages |
| Agent outputs | Free-text argument | {claim, citation_ids[], confidence, known} |
| Rejector role | Counter-argument only | Keep Rejector for UX explainability; add Checker for compliance validation |
| Knowledge | LLM pretraining only | Retrieval-grounded; every claim cites a chunk |
| Gap handling | Silent fallback to LLM priors | Emits KNOWLEDGE_GAP event; surfaces in Analyst UI |
| Version pinning | Implicit (code commit) | Explicit version_manifest tuple per session |
// packages/shared/src/agents-v3.ts
export const MakerOutputSchema = z.object({
claim: z.string(),
citation_ids: z.array(z.string()).min(1), // must cite ≥1 passage
confidence: z.number().min(0).max(1),
known: z.boolean(),
missing_knowledge: z.string().optional(), // populated when known=false
});
export const CheckerOutputSchema = z.object({
status: z.enum(["OK", "CONFLICT", "INSUFFICIENT_EVIDENCE"]),
conflicting_citation_id: z.string().optional(),
conflict_explanation: z.string().optional(),
});
export const JudgeFactorOutputSchema = z.object({
verdict: z.enum(["positive", "negative", "neutral"]),
summary: z.string(),
cited_passages: z.array(z.string()),
maker_confidence: z.number(),
checker_status: z.enum(["OK", "CONFLICT", "INSUFFICIENT_EVIDENCE"]),
});
SYSTEM
You are a credit origination analyst for a European bank.
You MUST:
- Base your claim on the RETRIEVED PASSAGES only.
- Cite every claim with passage IDs from the list below.
- Set `known: false` and fill `missing_knowledge` when no passage supports
the claim at the given jurisdiction and tier.
- Never invent regulation names, article numbers, or dates.
HUMAN
FACTOR: {factor_name}
JURISDICTION: {jurisdiction}
APPLICATION: {application_json}
RETRIEVED PASSAGES:
{passages_with_ids}
FORMAT:
{format_instructions}
maxSimilarity < 0.72 for the factor query.known: false.confidence < 0.6.INSUFFICIENT_EVIDENCE.
Each trigger creates one knowledge_gaps row. Duplicates on
(factor, missing_topic_hash) are collapsed — we only alert operators
once per novel gap per 24h window.
| Type | Trigger | Storage | Reuse |
|---|---|---|---|
| Correction | Human overrides decision | feedback_corrections | Few-shot retrieval at inference |
| Knowledge | Human answers a gap | knowledge_chunks (new row) + feedback_knowledge audit | Live in future retrievals |
| Calibration | Realised loan outcome known | feedback_outcomes | Eval harness accuracy metric |
version_manifests
├── id UUID, primary key
├── label VARCHAR — "v3.4.1"
├── model VARCHAR — "claude-sonnet-4-6"
├── prompt_set_id VARCHAR — "ps-2026-04-17"
├── knowledge_bundle UUID → knowledge_bundles.id
├── guardrail_set_id VARCHAR — "gr-v2"
├── eval_score NUMERIC — 0.87
├── eval_run_id UUID
├── status ENUM: DRAFT | ACTIVE | DEPRECATED
├── released_at TIMESTAMPTZ
└── released_by VARCHAR
-- debate_sessions gains a column
ALTER TABLE debate_sessions
ADD COLUMN version_manifest_id UUID NOT NULL
REFERENCES version_manifests(id);
ACTIVE at a time.
Rollback is a single UPDATE away. Every historical decision is reproducible because
the bundle, prompt set, and guardrails are all content-addressed.
npm run eval -- --version v3.4 --against v3.3 → emits a diff report (decision flips, citation precision, gap-rate delta, latency p95 / p50).REVIEW, not pick a side. Tracked separately.
A first-pass eval harness is in the codebase: npm run eval:bondora -- --csv path/to/LoanData_Bondora.csv --limit 50. It reads the public Bondora P2P
lending dataset, maps each historical loan to a CreditApplication,
runs it through runDebate(), and scores the agent's APPROVE / REJECT /
REVIEW against the realised outcome (Repaid vs Default).
Output is a confusion-matrix scoreboard:
Mapping is documented inline in the script: rating letter → synthetic FICO, employment status code → enum, missed-payments proxy from NewCreditCustomer + PreviousEarlyRepaymentsCountBeforeLoan. Rows
with no terminal outcome (Status=Current) or missing required fields are skipped.
Eval calls bypass the clarification flow. The script
passes { noClarifications: true } to runDebate(),
which forces maxClarifications = 0 for that call only —
the cap fires immediately on any clarification request, no
awaitClarification pause, no 5-minute timeout per row.
This is the only honest way to measure the system's autonomous
decision-making against ground truth: feeding placeholder answers
would measure how the agents react to fake operator input, not
what they decide on their own. Production debates running
concurrently with the eval are unaffected.
Domain mismatch caveat: Bondora is unsecured P2P consumer lending in EE/ES/FI, mostly pre-2020. Findings are a useful sanity check, not a precise predictor of EU regulated bank lending performance. The script is designed for "is the direction right" answers, not "is the FP rate exactly 4.7%" answers.
knowledge_gaps
├── id UUID, primary key
├── session_id UUID → debate_sessions.id
├── factor VARCHAR
├── trigger_signal ENUM: low_retrieval | maker_unknown | low_confidence |
│ checker_insufficient | jurisdiction_missing
├── missing_topic TEXT — agent's own description
├── suggested_sources TEXT[] — agent-proposed resources
├── tier_needed SMALLINT
├── jurisdiction_needed VARCHAR
├── topic_hash VARCHAR — for deduplication
├── status ENUM: OPEN | ANSWERED | INGESTED | DISMISSED
├── resolution_note TEXT, nullable
├── resolved_by VARCHAR, nullable
├── resolved_at TIMESTAMPTZ, nullable
├── resulting_chunk_id UUID, nullable → knowledge_chunks.id
├── created_at TIMESTAMPTZ
UNIQUE (topic_hash, status) WHERE status = 'OPEN'
feedback_corrections
├── id UUID, primary key
├── session_id UUID → debate_sessions.id
├── field VARCHAR — "final_decision" | "factor.verdict.Credit Score" | ...
├── agent_value JSONB
├── human_value JSONB
├── reason TEXT
├── embedding VECTOR(1024) — embedded(application + reason) for few-shot retrieval
├── created_by VARCHAR
├── created_at TIMESTAMPTZ
| Method | Path | Purpose |
|---|---|---|
| POST | /decision | Create session, start grounded debate. Returns sessionId + version_manifest_id. |
| GET | /decision/:id/stream | SSE — agent tokens, factor turns, gap events, final decision. |
| GET | /decision/:id | Applicant view (filtered). |
| GET | /analyst/decision/:id | Analyst view (full, incl. citations + gaps). |
| POST | /knowledge/ingest | Upload doc → enqueues ingestion job → returns bundle_id when done. |
| GET | /knowledge/bundles | List bundles with metadata. |
| GET | /knowledge/search | Debug retrieval — returns passages for a query. |
| GET | /gaps | List open knowledge gaps (operator queue). |
| POST | /gaps/:id/resolve | Resolve gap — accepts uploaded doc OR text note OR dismiss reason. |
| POST | /feedback/correction | Submit human override for a session. |
| POST | /feedback/outcome | Record realised loan outcome. |
| GET | /versions | List version manifests. |
| POST | /versions/:id/activate | Promote manifest to ACTIVE (atomic). |
| POST | /eval/run | Run golden set against a version; returns eval_run_id. |
| GET | /eval/:id/diff/:other | Diff two eval runs. |
| Event | When | Payload |
|---|---|---|
retrieval_done | After retriever returns for a factor | factor, passage_count, max_similarity, tier_coverage |
maker_done | Maker structured output parsed | factor, claim, citation_ids, confidence, known |
checker_done | Checker completes | factor, status, conflict_citation_id? |
gap_detected | Gap detector trips | gap_id, factor, trigger_signal, missing_topic |
final_decision | Judge + guardrails complete | Full payload incl. version_manifest_id |
One codebase, two deployment targets. Adapters behind interfaces let us swap infra without touching business logic. local = laptop dev, aws = production.
| Component | Local | AWS | Swap via |
|---|---|---|---|
| Relational DB | Postgres in Docker + pgvector | RDS Postgres 16 + pgvector extension | DATABASE_URL |
| Vector store | pgvector (same DB) | pgvector on RDS · or OpenSearch Serverless at scale | Adapter interface |
| Object storage (raw docs) | Local FS ./data/raw/ | S3 bucket · SSE-KMS | @aws-sdk/client-s3 behind BlobStore |
| LLM | Anthropic API direct | Bedrock anthropic.claude-sonnet-4 | LangChain provider env var |
| Embeddings | OpenAI text-embedding-3-large · or local bge-m3 | Bedrock amazon.titan-embed-text-v2 OR SageMaker bge-m3 | Embedder interface |
| Reranker | HuggingFace inference (bge-reranker-v2-m3) | SageMaker endpoint | Reranker interface |
| Ingestion workers | Node worker in same process (dev only) | SQS → Lambda OR ECS Fargate task | Queue adapter |
| API server | npm run dev on :3001 | ECS Fargate behind ALB · or Lambda + API Gateway for lower traffic | None — 12-factor app |
| Frontend | Next.js dev on :3000 | Amplify Hosting OR Vercel (prefer Vercel — faster iteration) | None |
| Auth | Dev token in header | Cognito → JWT at ALB | Fastify auth plugin |
| Secrets | .env | AWS Secrets Manager · loaded at startup | Config loader |
| Observability | pino → stdout | CloudWatch Logs · OpenTelemetry traces → X-Ray | pino transport |
| Eval runs | CLI in repo | Scheduled ECS task · results to S3 + Postgres | None |
# docker-compose.dev.yml — run: docker compose up
services:
postgres:
image: pgvector/pgvector:pg16
environment:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: challenger_dev
ports: ["5432:5432"]
volumes: ["pgdata:/var/lib/postgresql/data"]
minio: # S3-compatible local storage
image: minio/minio
command: server /data --console-address ":9001"
ports: ["9000:9000", "9001:9001"]
environment:
MINIO_ROOT_USER: dev
MINIO_ROOT_PASSWORD: devpassword
# Reranker served via HuggingFace's TEI for local parity with SageMaker
reranker:
image: ghcr.io/huggingface/text-embeddings-inference:latest
command: ["--model-id", "BAAI/bge-reranker-v2-m3"]
ports: ["8080:80"]
volumes: { pgdata: {} }
EMBED_PROVIDER=voyage +
voyage-3-lite (512 dims). Tunable runtime parameters
(RETRIEVAL_*, DEFAULT_JURISDICTION,
CLARIFICATION_*, etc.) have moved out of env vars and into the
runtime_settings table — change them from the admin Settings tab without
a redeploy. Env vars below are the static infrastructure config that genuinely belongs
at boot time.
# Shared
NODE_ENV=development
LLM_PROVIDER=anthropic # anthropic | openai
LLM_MODEL=claude-haiku-4-5-20251001
LLM_API_KEY=sk-ant-...
# Embeddings — Voyage is Anthropic-recommended; voyage-3-lite has a
# generous free tier and produces 512-dim vectors.
EMBED_PROVIDER=voyage # voyage | openai | bedrock
EMBED_MODEL=voyage-3-lite
EMBED_API_KEY=pa-...
EMBED_DIMENSIONS=512
# Database — local Postgres + pgvector for dev, Neon for Fly deploy.
DATABASE_URL=postgresql://postgres:postgres@localhost:5433/debate_db
# Static fallbacks for the runtime_settings rows (used only if a row
# is missing from the table — normally the table wins).
RETRIEVAL_ENABLED=true
RETRIEVAL_TOP_K=5
RETRIEVAL_MIN_SIMILARITY=0.55
DEFAULT_JURISDICTION=EU
# Admin panel
ADMIN_PASSWORD_HASH=$2b$12$... # bcrypt; generate with: npm run admin:hash -- 'pwd'
COOKIE_SECRET=... # 32+ random hex chars; openssl rand -hex 32
# CORS — comma-separated allowlist. Wildcards in dev only.
CORS_ORIGIN=https://your-frontend.fly.dev
# Observability
LOG_SQL=1 # SQL preview + timing in stdout
LOG_COLOR=1 # ANSI category tags in stdout
FORCE_COLOR=1 # force color when piped (e.g. fly logs)
# AWS-only (when we eventually flip from Fly to AWS)
AWS_REGION=eu-central-1
BEDROCK_LLM_MODEL=anthropic.claude-sonnet-4-v1:0
BEDROCK_EMBED_MODEL=amazon.titan-embed-text-v2:0
COGNITO_USER_POOL_ID=...
COGNITO_CLIENT_ID=...
S3_BUCKET_RAW=challenger-raw-docs
eu-central-1 (Frankfurt) for EU data residency. Bedrock has Claude
available there. RDS and ECS are native. This matters for both compliance positioning
and EBA outsourcing guidelines.
challenger/
├── backend/
│ ├── src/
│ │ ├── agents/
│ │ │ ├── maker.ts # NEW
│ │ │ ├── checker.ts # NEW (replaces rejector in compliance mode)
│ │ │ ├── rejector.ts # kept for UX explainability
│ │ │ ├── judge.ts
│ │ │ ├── prompts/
│ │ │ │ ├── maker.ts
│ │ │ │ ├── checker.ts
│ │ │ │ ├── judge.ts
│ │ │ │ └── index.ts # exports prompt_set_id
│ │ │ └── llmClient.ts # provider switch (Anthropic | Bedrock)
│ │ ├── knowledge/
│ │ │ ├── ingest/
│ │ │ │ ├── loader.ts # PDF | HTML
│ │ │ │ ├── chunker.ts # section-aware
│ │ │ │ ├── embedder.ts # provider switch
│ │ │ │ └── worker.ts # SQS consumer in prod
│ │ │ ├── retriever.ts # hybrid + rerank
│ │ │ ├── bundles.ts
│ │ │ └── index.ts
│ │ ├── engines/
│ │ │ ├── guardrail.ts
│ │ │ ├── fallback.ts
│ │ │ └── gapDetector.ts # NEW
│ │ ├── orchestrator/
│ │ │ ├── index.ts # v3 orchestrator
│ │ │ └── factorLoop.ts
│ │ ├── feedback/
│ │ │ ├── corrections.ts
│ │ │ ├── outcomes.ts
│ │ │ └── fewShot.ts # retrieves top-k corrections
│ │ ├── versioning/
│ │ │ ├── manifest.ts
│ │ │ └── activate.ts
│ │ ├── eval/
│ │ │ ├── goldenSet/ # JSON cases
│ │ │ ├── runner.ts
│ │ │ └── diff.ts # CLI-callable
│ │ ├── platform/
│ │ │ ├── blobStore.ts # local FS | S3 adapter
│ │ │ ├── queue.ts # in-proc | SQS adapter
│ │ │ ├── auth.ts
│ │ │ ├── config.ts
│ │ │ └── logger.ts
│ │ ├── routes/
│ │ │ ├── decision.ts
│ │ │ ├── knowledge.ts
│ │ │ ├── gaps.ts
│ │ │ ├── feedback.ts
│ │ │ ├── versions.ts
│ │ │ └── eval.ts
│ │ ├── db/
│ │ │ ├── migrations/
│ │ │ │ ├── 001_init.sql
│ │ │ │ ├── 010_knowledge.sql
│ │ │ │ ├── 011_gaps.sql
│ │ │ │ ├── 012_feedback.sql
│ │ │ │ └── 013_versioning.sql
│ │ │ ├── pool.ts
│ │ │ ├── sessions.ts
│ │ │ ├── chunks.ts
│ │ │ ├── gaps.ts
│ │ │ └── versions.ts
│ │ ├── buffer.ts # SSE buffer (unchanged)
│ │ └── index.ts
│ └── package.json
├── frontend/
│ └── src/
│ ├── app/
│ │ ├── decision/[id]/
│ │ ├── analyst/decision/[id]/
│ │ ├── operator/
│ │ │ ├── gaps/ # NEW — gap queue
│ │ │ ├── knowledge/ # NEW — ingestion + bundles
│ │ │ └── versions/ # NEW — manifest admin
│ ├── components/
│ │ ├── FactorDebate/
│ │ ├── CitationChip/ # NEW
│ │ ├── GapBanner/ # NEW
│ │ ├── ResolveGapDialog/ # NEW
│ │ └── VersionBadge/ # NEW
│ └── hooks/
│ ├── useDebateStream.ts
│ └── useGapStream.ts # NEW
├── packages/
│ └── shared/
│ └── src/
│ ├── types.ts
│ ├── schemas/
│ │ ├── maker.ts
│ │ ├── checker.ts
│ │ ├── judge.ts
│ │ ├── gap.ts
│ │ └── feedback.ts
│ └── retrieval.ts
├── infra/
│ ├── docker-compose.dev.yml
│ └── aws/
│ ├── cdk/ # CDK app (TS) — one stack per concern
│ │ ├── bin/challenger.ts
│ │ ├── lib/
│ │ │ ├── network-stack.ts # VPC + subnets
│ │ │ ├── data-stack.ts # RDS · S3 · SQS
│ │ │ ├── compute-stack.ts # ECS services · ALB
│ │ │ ├── ai-stack.ts # Bedrock IAM · SageMaker endpoints
│ │ │ └── observability-stack.ts
│ │ └── cdk.json
│ └── README.md
├── docs/
│ ├── Architecture_Recommendation_v3_Self_Improving_Agent.md
│ ├── Core_Flows_—_Factor-Based_Debate_(v2).md
│ └── Tech_Plan_—_Multi-Agent_Credit_Decision_System.md
├── challenger-spec-v3.html # this file
└── README.md
| Phase | Goal | Deliverables | Weeks |
|---|---|---|---|
| Phase 0 | Knowledge foundation |
Ingest 3 top-tier docs (EBA/GL/2020/06, ECB Guide, OeNB) · pgvector table · hybrid
retriever · /knowledge/ingest + /knowledge/search endpoints ·
Maker/Rejector prompts accept retrieved_passages. No UI changes.
|
1–2 |
| Phase 1 | Grounded debate + gap detection |
New agent output schemas with citation_ids · Checker agent ·
gap detector · knowledge_gaps table · SSE gap_detected event ·
Analyst UI: citation chips + gap banner.
|
3–4 |
| Phase 2 | Feedback capture — first shippable demo | Operator UI: gap queue + resolve dialog · correction dialog on Analyst view · few-shot retriever injects corrections into Maker prompt · single-tenant auth (Cognito dev pool). Show to 3 compliance officers. | 5–6 |
| Phase 3 | Eval + versioning |
200-case golden set · version_manifests table · eval runner CLI · diff
report · version_manifest_id persisted on every session · version badge in UI.
|
7–10 |
| Phase 4 | Self-improvement automation | Scheduled re-run of failing eval cases after gap resolution · "what improved" weekly report · prompt-evolution helper · jurisdiction-aware retrieval routing · multi-bank tenant isolation (if pilot demands). | 11–14 |
| Deferred | Fine-tuning · multi-tenancy · on-prem · classical-ML scoring model | Revisit after 3 paying pilots and ≥5k labelled corrections. Fine-tuning via Bedrock model customisation only when the business case is measurable. | — |
pgvector extension running locally via docker-compose.SELECT count(*) FROM knowledge_chunks ≥ 500.GET /knowledge/search?q=missed+payments&jurisdiction=EU returns 5 passages in < 300ms.3 compliance officers test it. Each runs 10 cases. We capture: did they trust the citations? Did the gap queue capture real unknowns? Did they use the correction UI? Without their validation, Phase 3 is premature.
{session_id, agent, version_manifest_id, model, prompt_hash, input_tokens, output_tokens, latency_ms, cost_usd}.trace_id into SSE so UI can link to backend trace.applicant_id; redact loan amount and income ranges by default in non-prod logs.prompts/* or agents/* → CI runs a 20-case smoke eval.ACTIVE manifest.eu-central-1) — no data egress to US endpoints.audit_log table.| Decision | Choice | Trade-off accepted |
|---|---|---|
| Vector store | pgvector in same RDS | Scales to ~10M chunks comfortably; swap to OpenSearch later. Saves one service. |
| LLM provider | Anthropic dev → Bedrock prod | Region-locked, compliance-friendly. Costs ~15% more; acceptable for regulated buyers. |
| Debate vs single-pass | Debate kept, with cheap-path escape | High-confidence cases (Maker confidence ≥ 0.9 AND Checker OK) skip Rejector. Cost control. |
| Fine-tuning | Deferred indefinitely | RAG + few-shot corrections capture ~90% of value for ~10% of cost. |
| Rejector vs Checker | Both, different roles | Rejector = UX explainability. Checker = compliance validation. Judge reads both. |
| Multi-tenancy | Single-tenant in MVP | Postpone row-level security + per-bank knowledge isolation until paying pilot demands it. |
| Ingestion worker | In-process locally, SQS+ECS in prod | Queue adapter keeps code identical. |
| Auth | Cognito | Standard AWS pathway; OIDC for future bank SSO. |
| Frontend hosting | Vercel (Amplify as alternative) | Faster iteration than Amplify. Move to Amplify only if data-residency becomes blocking. |
| Classical ML scoring model | Out of scope | If ever needed, Checker calls it as a tool — do not entangle with agent pipeline. |
docker compose -f infra/docker-compose.dev.yml up — Postgres+pgvector, MinIO, TEI reranker.
Migration 010_knowledge.sql creates knowledge_bundles and
knowledge_chunks with ivfflat index.
backend/src/knowledge/ingest/ —
PDF loader (via pdf-parse) · section-aware chunker (regex on
Article \d+ / §) · embedder (OpenAI for dev) · insert chunks
with bundle_id. Target: process EBA/GL/2020/06 end-to-end.
backend/src/knowledge/retriever.ts — hybrid (dense + tsvector) ·
tier-weighted fusion · reranker POST to http://localhost:8080.
Expose as GET /knowledge/search?q=&jurisdiction=&factor= for debug.
backend/src/agents/prompts.ts — Approver and Rejector templates accept
retrieved_passages. Orchestrator calls retriever before each factor.
Ship, smoke-test, verify existing eval still passes.
citation_ids and known — land it as a separate
PR after Phase 0 is green.