InsightBase ingests the Home Office Register of Licensed Sponsors monthly and matches each licence to a canonical company. About 73% match cleanly via Phase-1 exact matching; the remainder go through Phase-2 fuzzy matching with operator review.
Phase 1 — exact match
On ingest, every licence row is canonicalised: company name is normalised through fn_normalize_company_name (lowercase, strip ltd|limited|plc|llp suffix, strip non-alphanumeric) and looked up against ch_companies.normalized_company_name. Matches set matched_company_id directly.
Phase 1 is fast (~1–2 min for 130k licences) and high-confidence. Currently matches about 96k of 131k licences (73%).
Phase 2 — fuzzy match
Anything Phase 1 didn't catch becomes a candidate for Phase 2. A DuckDB worker computes trigram similarity against the canonical normalized name set and emits candidate pairs above a threshold (currently 0.85) into sponsor_licence_match_candidates.
Operators review candidates and decide accept / reject / defer. The decision is recorded in sponsor_licence_match_candidates.statusand, on accept, the parent licence's matched_company_id is set.
Why some licences never match
- Holding-company structures (licence held by parent, but data lists trading subsidiary)
- Licences with typos or name changes the fuzzy threshold doesn't cover
- Recently-incorporated companies absent from older snapshots
- Sole traders, partnerships, or charities outside the CH corpus
Filter use cases
Once matched, sponsor data is available as filters and joinable in the Intelligence Studio:
- Has sponsor licence — boolean
- Sponsor licence rating — A, A-rated (premium), B
- Sponsor licence type — Worker, Temporary worker, etc.
- Route — Skilled worker, Health and care worker, etc.
- Sponsor licence expiry — date filter
Data refresh cadence
- Home Office publishes the register monthly (typically 1st of the month)
- Phase-1 ingest runs nightly until everything is matched / classified
- Phase-2 fuzzy match runs once per refresh, then on demand