Skip to main content
All newsMethodology

Evaluating AI for Canadian regulated work: a methodology

A vendor-neutral protocol for measuring how AI models perform on Canadian legal, privacy, and constitutional reasoning — bilingual, reproducible, and citable in procurement.

By the SimpleDirect teamToronto · June 8, 20267 min read

Why this exists

Most public evaluation suites for language models are US-centric, common-law, and English-only. They do not measure the things that determine whether a model is fit for Canadian regulated work.

A model can score 90% on MMLU professional law and still:

  • Misapply the Civil Code of Québec because it was reasoning under common-law assumptions
  • Fabricate a citation that looks plausible but does not exist
  • Produce English-quality legal reasoning in French at a fraction of its English accuracy
  • Refuse benign informational queries while answering unauthorized-practice ones

None of these failure modes are visible on the benchmarks that procurement officers, RFP authors, and AI buyers currently rely on. So this methodology defines what to measure and how to measure it, for the things that matter when AI meets Canadian regulated work.

It is vendor-neutral: the same protocol applies to our own models and to any third-party model. It is intended to be citable in procurement evaluations, RFP scoring rubrics, and academic work on Canadian-context AI.

Governing principles

Five principles govern every choice in this methodology.

Reproducibility over headline numbers. Every score must be regenerable from published items, scoring code, and fixed decoding settings. A number that cannot be reproduced is not reported. This sounds obvious. It is not common practice.

Robust scoring over generation parsing. Where possible, use loglikelihood-based multiple-choice scoring rather than parsing a letter out of free-form text. Generation parsing is fragile with verbose, reasoning-style models — the same model can score differently on the same items just by emitting more chain-of-thought. The methodology treats this as a defect to be fixed before any number is published.

Leak-proofing by construction. Held-out evaluation items must be built so the answer cannot be recovered by string-matching the prompt, and so that no training document trivially contains the answer.

Expert validation before publication. Machine-generated items are drafts, never ground truth. Each published item must be reviewed by a person qualified in the relevant area of Canadian law, and for French items, in legal French.

Bilingual parity is a first-class metric. A model that is strong in English and weak in French has failed a Canadian bilingual requirement, even if its average looks acceptable.

Eight tracks

The methodology defines eight evaluation tracks. Each is scored independently; there is no single blended score, because a procurement officer cares about the specific competency relevant to their workflow.

TrackWhat it measuresScoring method
Common lawDoctrine across Canada's common-law jurisdictionsMCQ, loglikelihood or final-answer extraction
Quebec civil lawCivil Code of Québec reasoning, in FrenchMCQ, loglikelihood or final-answer extraction
Constitutional / CharterCharter rights, s.1 proportionality, division of powersMCQ + structured-analysis rubric
Privacy compliancePIPEDA and provincial privacy reasoning, EN/FRMCQ, reported with bilingual parity ratio
Citation integrityProduction of correct, verifiable legal citationsCitation-pattern validation against a reference
Safety calibrationRefuse unauthorized legal advice; answer benign queriesRefusal/answer classification
Grounded retrieval (RAG)Correct source attribution in a retrieval settingExact-match on document identity (leak-proof set)
General-capability retentionNo catastrophic forgetting from specializationStandard public benchmarks (MMLU, etc.)

The first six are operationalized in the CBLRE Evaluation Suite public release. The retrieval track uses a separate leak-proof companion set. The general-capability track uses standard public benchmarks to verify that legal specialization has not destroyed general competence.

Bilingual parity, measured properly

Bilingual competence is measured as a parity ratio, not two unrelated scores. For a track available in both languages, matched item pairs test the same competency in English and in Canadian French. Each is scored separately, and a parity ratio (FR accuracy / EN accuracy) is reported per track.

A ratio near 1.0 indicates balanced bilingual competence. A ratio well below 1.0 indicates the model is materially weaker in French and is not fit for a bilingual Canadian requirement, regardless of how its English score looks.

Parity is reported per track. A model may show parity on privacy reasoning but not on civil-law reasoning. These are distinct findings and must not be averaged away.

Quebec French requires its own treatment

Quebec legal French is not interchangeable with Metropolitan (France) French. The methodology distinguishes two separable questions:

  • Legal correctness — is the substantive answer right under Quebec civil law? Scored programmatically against validated ground truth.
  • Register and terminology — does the model use correct Quebec civil-law vocabulary and professional register, as opposed to France-French or anglicism-laden phrasing? This requires native-Quebec-French human raters and is assessed separately from correctness.

Public French benchmarks (multilingual MMLU, BeleBele) measure general French comprehension. They do not certify Quebec dialect, register, or civil-law terminology. The methodology states this limitation explicitly rather than letting a general-French score stand in for Quebec-French competence.

Leak-proof retrieval

The grounded-retrieval track is the strongest test of genuine Canadian-context capability because it cannot be satisfied by memorized doctrine. Its construction rules:

  • Source documents are drawn from a corpus held out by date — e.g. annual statutes from a year excluded from training, while training used the consolidated base acts
  • Each item presents several candidate source passages and asks which one concerns a named topic
  • The topic label is taken from the passage's marginal note and then stripped from the displayed text, so the answer cannot be recovered by string-matching the prompt
  • Distractors include passages from the same statute, so the act title alone does not solve the item
  • Scored by exact match on the source identity; the random baseline is reported alongside the score

A high retrieval score under this construction reflects genuine topic-to-source attribution, not recall of training text.

Scoring robustness for reasoning models

Modern models often emit extended chain-of-thought before committing to an answer. Naïve answer extraction can capture a letter from the reasoning rather than the final conclusion, producing scores that change with token budget even though the model and items are fixed.

The methodology requires that multiple-choice extraction (where loglikelihood scoring is not used) take the model's final committed answer — the last answer-commitment in the response — and that the extractor be validated by confirming that scores are stable across token budgets. A scorer whose output changes with response length is treated as a defect to be fixed before any numbers are reported.

The expert-validation gate

No track score is publishable until its items pass expert validation:

  • Each item is reviewed by a person qualified in the relevant area of Canadian law
  • French and Quebec civil-law items are additionally reviewed by a reviewer competent in legal French
  • Items with incorrect gold answers, ambiguous phrasing, or fabricated citations are corrected or removed before release
  • Until this review is complete, results are released as a clearly-labelled preview, with the validation status stated on every reported number

Reporting requirements

Any result reported under this methodology must state:

  • The exact model and checkpoint evaluated
  • Per-track scores (never a single blended number)
  • Bilingual parity ratios where applicable
  • The random baseline for retrieval tracks
  • Few-shot counts and decoding settings
  • Validation status of the items used

Regressions must be reported with the same prominence as gains.

Why we built this

Building the standard for ourselves alone would have been a missed opportunity. The Canadian AI procurement landscape — federal, provincial, professional services, regulated enterprise — needs a measurement instrument that is vendor-neutral, reproducible, and specific to Canadian context. There was no such instrument. So we built one and made it public.

We expect to be judged against it ourselves.

Cite this

SimpleDirect® (Alpine Pacific Trading Inc.), "Canadian Regulated-Workflow Evaluation Methodology (v1.0)," June 2026.

Read more

Where to next


SimpleDirect®, operating as Alpine Pacific Trading Inc., is a Toronto-based team building open-weight, bilingual Canadian-context AI models you can download, run, and own.

Share this