Introducing CBLRE: a public benchmark for Canadian bilingual legal AI

The gap

If you procure AI for Canadian regulated work — legal, privacy, constitutional — you have a problem: there is no standard public benchmark that measures what actually matters. US legal benchmarks are common-law and English-only. General multilingual benchmarks test French fluency but not legal French. None of them touch Quebec civil law as a distinct tradition. None measure bilingual parity as a first-class metric.

A model can score well on a US legal benchmark and still be unusable for a Quebec notarial workflow, a federal privacy-impact assessment, or a Charter analysis. The benchmarks weren't measuring those things.

CBLRE — the Canadian Bilingual Legal & Regulatory Evaluation — is the public benchmark that does.

What CBLRE is

CBLRE is an open test set for evaluating language models on Canadian bilingual legal and regulatory reasoning. The current release (v1.0) contains 129 expert-reviewed items across six active tracks. Each item is structured as a JSON record with a stable identifier, the prompt, the validated ground truth, and provenance documentation.

The tracks:

Track	Items	Language	Item type
Common law	21	EN	Multiple-choice doctrine
Quebec civil law	20	FR	Multiple-choice, Civil Code of Québec
Constitutional / Charter	22	EN	Multiple-choice + structured analysis
Privacy compliance	22	EN/FR	Multiple-choice, bilingual-paired
Citation integrity	22	EN	Citation production / validation
Safety calibration	22	EN	Refusal vs. answer classification

Two further tracks are defined in the methodology — grounded retrieval (RAG) and general-capability retention — and are evaluated via companion held-out sets and standard public benchmarks respectively.

Why this structure

Three design choices distinguish CBLRE from the alternatives:

Quebec civil law is its own track, in French. Most "Canadian legal" benchmarks (when they exist at all) treat civil law as an English-language variant of common law. CBLRE treats it as what it is: a distinct legal tradition, reasoned in French, anchored in the Civil Code of Québec.

Bilingual parity is measured, not averaged. The privacy-compliance track is built as matched English/French pairs. The parity ratio — FR accuracy divided by EN accuracy — is reported per track. A model can be strong in English privacy reasoning and weak in French privacy reasoning; CBLRE will show that, not blend it into a misleading average.

The dataset card publishes no model scores. CBLRE v1.0 documents the instrument: tracks, item structure, scoring methods, validation status, limitations. Model scores are published separately, only after expert validation, and only at the version of the item bank that produced them. Earlier scores cannot be compared to later ones; every report must cite the version used.

Scoring methods

Each item declares its scoring method explicitly:

Multiple-choice exact-match — graded on the model's final committed answer; loglikelihood scoring is preferred where the harness supports it.
Citation validation — checks for a correct, well-formed legal citation against a reference, rather than a single letter.
Refusal / answer calibration — verifies the model refuses unauthorized-practice requests and answers benign informational ones.
Grounded retrieval exact-match — source-identity match on a leak-proof held-out set (companion release).

Scoring is deterministic given fixed decoding. Any score reported from CBLRE must be regenerable from the published items and code.

Status: development-grade, growing

CBLRE v1.0 is the first public release of an evolving benchmark. The current item bank is an expert-reviewed starting point, not a final set. Numbers derived from v1.0 are development-grade and will be superseded by larger, versioned releases.

We're publishing at this stage rather than waiting for a "complete" set because the standard is the point. A maintained, growing public benchmark is more useful to Canadian AI procurement and research than a polished one-time snapshot.

What's actively in development:

More items per track — substantially beyond the current 129, each new item passing the same expert-validation gate
Broader domain coverage — extending toward tax and benefits, employment and labour, immigration, securities and financial compliance, and healthcare-privacy intersections, as qualified reviewers are secured
Structured-analysis items — rubric-scored structured-reasoning items (e.g. full Oakes/Charter analyses) that carry more signal than multiple choice
Expanded bilingual pairing — matched EN/FR pairs across more tracks
Versioned, DOI-tagged releases — so prior results remain reproducible against the exact item set that produced them

How to use it

For AI researchers and developers:

Load the item bank (JSONL).
Serve the model under test with greedy decoding (temperature 0) and a fixed seed.
Run the scoring code, which dispatches each item to its declared scoring method.
Report per-track accuracy, bilingual parity ratios, and — for retrieval — the random baseline alongside the score.

For procurement officers and RFP authors: CBLRE is intended to be citable in procurement evaluations and RFP scoring rubrics. The vendor-neutral structure — the same protocol applies to any model, including our own — means it can be specified as the evaluation standard in tender documents.

For legal practitioners and researchers: the track structure maps to real practice domains. A privacy lawyer evaluating an AI tool can look at the privacy parity ratio. A constitutional scholar can look at the Charter track. The methodology document explains how to interpret each.

Honest limitations

The set is small and growing. v1.0 is expert-reviewed but development-grade.
CBLRE is authored by SimpleDirect; items began as AI-assisted drafts and are corrected under expert review. Three items with incorrect gold answers were caught and removed during the current review — documented here for transparency.
CBLRE measures legal correctness; Quebec dialect and register quality requires native-Quebec-French human raters and is assessed separately in the methodology.
On settled doctrine, capable models can score near the top of multiple-choice tracks. The retrieval and citation tracks carry more signal between strong models.

Why we're publishing this

There was no Canadian bilingual legal AI benchmark before. There is one now. We built it because the work we're doing — training and shipping AI for Canadian regulated workflows — required a measurement standard that did not exist. Building it for ourselves alone would have been a missed opportunity. Publishing it makes the standard available to everyone evaluating AI for Canadian context: procurement, research, academia, and other vendors.

The instrument is vendor-neutral. We report our own models' scores against it. We expect others to do the same.

Cite this

SimpleDirect® (Alpine Pacific Trading Inc.), "CBLRE: Canadian Bilingual Legal & Regulatory Evaluation (v1.0)," June 2026.

Versioned releases will carry DOIs once hosted.

Canadian AI Evaluation Methodology v1.0 — the protocol CBLRE operationalizes: what to measure for Canadian regulated AI work, and why.
Model Benchmarking Methodology v1.0 — how we measure our own models using CBLRE and the broader evaluation suite, reproducibly.

Where to next

See all four public goods Contact us

SimpleDirect®, operating as Alpine Pacific Trading Inc., is a Toronto-based team building open-weight, bilingual Canadian-context AI models you can download, run, and own.

Introducing CBLRE: a public benchmark for Canadian bilingual legal AI

The gap

What CBLRE is

Why this structure

Scoring methods

Status: development-grade, growing

How to use it

Honest limitations

Why we're publishing this

Cite this

Read more

More from SimpleDirect

Evaluating AI for Canadian regulated work: a methodology

How we measure our models: the benchmarking protocol

Every model, free forever