Admin operations — health, recovery, and rails

Platform admin operations live under /admin. This topic is the operator playbook for keeping the rails healthy and recovering from the failure modes we've seen in prod.

The cron map

30+ pg_cron jobs drive the platform. Grouped by purpose:

Daily-feed rail — discovery, ingest, finalize, row reaper, staged recovery, manifest reconciler, verification watchdog.
Stream consumer — every 2 min, drains the CH stream API into evt_stream_events.
Profile refresh — every minute, claims a batch of due-for-refresh companies and pulls fresh /company/{n} data.
Notifications — events, deliveries, sequences, escalations, triggers, digests, plus CH-specific notif bridge and email sender.
Lifecycle — bounded refresh, scores recompute, signals generation, signal expiry.
CH bulk — monthly bulk-seed (manual trigger), bulk people sync, daily-feed digest.

Reading rail health

Two health tables sit behind the dashboards:

ingest_run_health_checks — every 2 min, snapshot of every running daily-feed run with current processed-rows, rows-delta-since-last-check, and stall-seconds.
ingest_run_health_alerts — open alerts. Each row has severity (slow/stalled/severe/warning/critical),opened_at, last_seen_at, and resolution. Cleared alerts get a resolution string (completed = run finished, progressed / recovered = stall ended,auto: … = a rule-based watchdog cleared it).

Recovery actions

For days that won't auto-finalise:

From the day card, click Run again — re-dispatches the ingest. Most recoverable issues clear here.
If the day is set-not-equal: open the residual list, decide if the missing companies are recoverable (re-fetch) or permanent gaps (mark in ch_known_gaps).
If a run is stuck in verification_pending with set_equal=falsefor > 60min and the data is in place, manually finalise via fn_finalize_day_verification.

Bulk seed

The monthly Companies House BasicCompanyData snapshot is ~5M companies. The seeder runs from a Railway worker. Key gotchas: must connect via the IPv4 session pooler (port 5432), and the merge runs as a batched loop (one batch per RPC call) to avoid OOM.

Profile refresh — performance notes

The claim function uses a partial index on ch_companies(next_refresh_due_at) WHERE next_refresh_due_at IS NOT NULL. Design throughput is 30/min via 500ms pacing. Watch the CH 429 rate — if it climbs above ~5%, drop the limit by a third.