Skip to main content
All newsMethodology

Introducing CBLRE: a public benchmark for Canadian bilingual legal AI

The Canadian Bilingual Legal & Regulatory Evaluation is open — six tracks, bilingual ground truth, reproducible scoring. No model scores in the dataset card; those come separately, after expert validation.

By the SimpleDirect teamToronto · June 8, 20266 min read

The gap

If you procure AI for Canadian regulated work — legal, privacy, constitutional — you have a problem: there is no standard public benchmark that measures what actually matters. US legal benchmarks are common-law and English-only. General multilingual benchmarks test French fluency but not legal French. None of them touch Quebec civil law as a distinct tradition. None measure bilingual parity as a first-class metric.

A model can score well on a US legal benchmark and still be unusable for a Quebec notarial workflow, a federal privacy-impact assessment, or a Charter analysis. The benchmarks weren't measuring those things.

CBLRE — the Canadian Bilingual Legal & Regulatory Evaluation — is the public benchmark that does.

What CBLRE is

CBLRE is an open test set for evaluating language models on Canadian bilingual legal and regulatory reasoning. The current release (v1.0) contains 129 expert-reviewed items across six active tracks. Each item is structured as a JSON record with a stable identifier, the prompt, the validated ground truth, and provenance documentation.

The tracks:

TrackItemsLanguageItem type
Common law21ENMultiple-choice doctrine
Quebec civil law20FRMultiple-choice, Civil Code of Québec
Constitutional / Charter22ENMultiple-choice + structured analysis
Privacy compliance22EN/FRMultiple-choice, bilingual-paired
Citation integrity22ENCitation production / validation
Safety calibration22ENRefusal vs. answer classification

Two further tracks are defined in the methodology — grounded retrieval (RAG) and general-capability retention — and are evaluated via companion held-out sets and standard public benchmarks respectively.

Why this structure

Three design choices distinguish CBLRE from the alternatives:

Quebec civil law is its own track, in French. Most "Canadian legal" benchmarks (when they exist at all) treat civil law as an English-language variant of common law. CBLRE treats it as what it is: a distinct legal tradition, reasoned in French, anchored in the Civil Code of Québec.

Bilingual parity is measured, not averaged. The privacy-compliance track is built as matched English/French pairs. The parity ratio — FR accuracy divided by EN accuracy — is reported per track. A model can be strong in English privacy reasoning and weak in French privacy reasoning; CBLRE will show that, not blend it into a misleading average.

The dataset card publishes no model scores. CBLRE v1.0 documents the instrument: tracks, item structure, scoring methods, validation status, limitations. Model scores are published separately, only after expert validation, and only at the version of the item bank that produced them. Earlier scores cannot be compared to later ones; every report must cite the version used.

Scoring methods

Each item declares its scoring method explicitly:

  • Multiple-choice exact-match — graded on the model's final committed answer; loglikelihood scoring is preferred where the harness supports it.
  • Citation validation — checks for a correct, well-formed legal citation against a reference, rather than a single letter.
  • Refusal / answer calibration — verifies the model refuses unauthorized-practice requests and answers benign informational ones.
  • Grounded retrieval exact-match — source-identity match on a leak-proof held-out set (companion release).

Scoring is deterministic given fixed decoding. Any score reported from CBLRE must be regenerable from the published items and code.

Status: development-grade, growing

CBLRE v1.0 is the first public release of an evolving benchmark. The current item bank is an expert-reviewed starting point, not a final set. Numbers derived from v1.0 are development-grade and will be superseded by larger, versioned releases.

We're publishing at this stage rather than waiting for a "complete" set because the standard is the point. A maintained, growing public benchmark is more useful to Canadian AI procurement and research than a polished one-time snapshot.

What's actively in development:

  • More items per track — substantially beyond the current 129, each new item passing the same expert-validation gate
  • Broader domain coverage — extending toward tax and benefits, employment and labour, immigration, securities and financial compliance, and healthcare-privacy intersections, as qualified reviewers are secured
  • Structured-analysis items — rubric-scored structured-reasoning items (e.g. full Oakes/Charter analyses) that carry more signal than multiple choice
  • Expanded bilingual pairing — matched EN/FR pairs across more tracks
  • Versioned, DOI-tagged releases — so prior results remain reproducible against the exact item set that produced them

How to use it

For AI researchers and developers:

  1. Load the item bank (JSONL).
  2. Serve the model under test with greedy decoding (temperature 0) and a fixed seed.
  3. Run the scoring code, which dispatches each item to its declared scoring method.
  4. Report per-track accuracy, bilingual parity ratios, and — for retrieval — the random baseline alongside the score.

For procurement officers and RFP authors: CBLRE is intended to be citable in procurement evaluations and RFP scoring rubrics. The vendor-neutral structure — the same protocol applies to any model, including our own — means it can be specified as the evaluation standard in tender documents.

For legal practitioners and researchers: the track structure maps to real practice domains. A privacy lawyer evaluating an AI tool can look at the privacy parity ratio. A constitutional scholar can look at the Charter track. The methodology document explains how to interpret each.

Honest limitations

  • The set is small and growing. v1.0 is expert-reviewed but development-grade.
  • CBLRE is authored by SimpleDirect; items began as AI-assisted drafts and are corrected under expert review. Three items with incorrect gold answers were caught and removed during the current review — documented here for transparency.
  • CBLRE measures legal correctness; Quebec dialect and register quality requires native-Quebec-French human raters and is assessed separately in the methodology.
  • On settled doctrine, capable models can score near the top of multiple-choice tracks. The retrieval and citation tracks carry more signal between strong models.

Why we're publishing this

There was no Canadian bilingual legal AI benchmark before. There is one now. We built it because the work we're doing — training and shipping AI for Canadian regulated workflows — required a measurement standard that did not exist. Building it for ourselves alone would have been a missed opportunity. Publishing it makes the standard available to everyone evaluating AI for Canadian context: procurement, research, academia, and other vendors.

The instrument is vendor-neutral. We report our own models' scores against it. We expect others to do the same.

Cite this

SimpleDirect® (Alpine Pacific Trading Inc.), "CBLRE: Canadian Bilingual Legal & Regulatory Evaluation (v1.0)," June 2026.

Versioned releases will carry DOIs once hosted.

Read more

Where to next


SimpleDirect®, operating as Alpine Pacific Trading Inc., is a Toronto-based team building open-weight, bilingual Canadian-context AI models you can download, run, and own.

Share this