How we measure our models: the benchmarking protocol

Why this document exists

Most AI model launches publish numbers. Few publish the protocol that produced them.

This document specifies the full benchmarking protocol applied to SimpleDirect models — the complete evaluation suite, harnesses, serving configuration, scoring rules, and reporting standards. It is the measurement companion to the Canadian AI Evaluation Methodology, which defines the Canadian-legal tracks specifically.

Its purpose is simple: make every published number reproducible and every comparison fair.

If you cannot regenerate our scores from our published configuration, we did not benchmark; we marketed. We have tried to make sure you can regenerate them.

Five principles

Identical conditions for comparison. A fine-tuned model and its base are evaluated with the same prompts, few-shot counts, scoring code, decoding settings, and serving stack. Differences in results are then attributable to the model, not the test setup.

Reproducibility. Every score is regenerable from published items/tasks, the harness version, and fixed decoding (greedy, temperature 0, fixed seed).

Robust scoring over fragile parsing. Loglikelihood multiple-choice scoring is preferred. Where generation scoring is required, the extractor takes the model's final committed answer and is validated for stability across token budgets.

Per-capability reporting, no blended score. Results are reported per benchmark and per track. Regressions are reported with the same prominence as gains.

Leak-proofing for retrieval. Retrieval evaluation uses date-held-out sources constructed so answers cannot be string-matched from the prompt.

The evaluation suite

Every SimpleDirect model is measured across seven families, chosen so that specialization gains and any capability regressions are both visible.

Family	What it covers	Harness / source
General capability	MMLU, ARC-Challenge, HellaSwag, TruthfulQA (MC1/MC2), GSM8K, BBH	lm-evaluation-harness
French general	Belebele FR, MGSM FR, ARC FR, HellaSwag FR	lm-evaluation-harness
Legal knowledge	MMLU professional / international law / jurisprudence; Global-MMLU FR equivalents	lm-evaluation-harness
Instruction-following	IFEval (prompt-strict)	lm-evaluation-harness
Canadian legal (CBLRE)	6 tracks: common law, Quebec civil law, Charter, privacy, citation, safety	SimpleDirect CBLRE + scorer
Retrieval (RAG)	Leak-proof held-out Canadian source attribution	SimpleDirect held-out set
Function-calling	BFCL v4 (single- and multi-turn)	Berkeley Function-Calling Leaderboard

The suite is run in full on every model. No subset is cherry-picked. Specialization tradeoffs — including any regressions from base — are visible because the full set is run and reported.

Serving configuration

Models are served identically for the model under test and the base comparison:

Setting	Value
Engine	vLLM (OpenAI-compatible endpoint)
Precision	bf16
Decoding	Greedy, temperature 0, fixed seed
Context length	lm-eval tasks: 4,096; function-calling: 32,768
Flags	trust-remote-code; deterministic sampler
Isolation	Each model served from its own pinned GPU; no cross-contention during parallel runs

These settings are published not because they are exotic, but because they are the exact configuration under which our numbers were produced. Anyone with access to the model, the items, and our scoring code should be able to reproduce our scores within the noise floor of greedy decoding.

Few-shot settings

Few-shot counts are fixed per task and held identical between the model under test and its base:

Task	Few-shot
MMLU / legal-MMLU / Global-MMLU FR	5-shot
ARC-Challenge / ARC FR	25-shot
HellaSwag / HellaSwag FR	10-shot
TruthfulQA (MC1/MC2)	0-shot
GSM8K	5-shot
BBH	3-shot
Belebele FR	5-shot
MGSM FR	8-shot
IFEval	0-shot

Scoring rules

Multiple-choice. Loglikelihood scoring where the harness supports it (selects the highest-probability option directly). Where generation scoring is unavoidable, the extractor takes the final committed answer and is validated to produce stable scores across token budgets — a scorer whose output changes with response length is treated as a defect.

Citation. Citation-track scoring validates a correct, well-formed legal citation against a reference (pattern + identity), not a single letter — so it measures citation production, not multiple-choice luck.

Retrieval. Exact-match on source identity over a leak-proof held-out set; the random baseline is reported alongside the score. Parse-rate (share of outputs yielding a scorable answer) is reported separately from accuracy, since they are distinct quality signals.

Function-calling. BFCL v4 standard scoring, reported as overall plus sub-scores (non-live, live, multi-turn). Multi-turn is reported explicitly because it is the most demanding and most discriminating sub-category.

Bilingual parity. For bilingual-paired tracks, EN and FR accuracy are reported separately along with the parity ratio (FR/EN). Parity is never averaged away into a single number.

The reporting standard

Every published result states:

The exact model checkpoint and its base
Per-benchmark and per-track scores (no blended headline)
Few-shot counts and decoding
Random baselines for retrieval
Bilingual parity ratios where applicable
Validation status for any preview-grade items
Gains and regressions with equal prominence

A result that cannot be reproduced from the published configuration is not published. That is the rule.

The acceptance flow

A model reaches benchmarking only after passing the build-stage verification gates: weight-movement audit, key integrity, generation smoke test, multimodal check. It is then served under the configuration above, evaluated across the full suite, and the complete result set — including any regressions — is compiled into the benchmark report.

No subset is cherry-picked. The suite is run in full so that specialization tradeoffs are visible.

What this means for you

If you are a procurement officer: you can specify this methodology as the evaluation standard in your RFP. The protocol is vendor-neutral and citable. When we report model scores, you can verify them against the published items, code, and configuration.

If you are an AI researcher or developer: use this protocol as a template for evaluating your own Canadian-context models. Report your scores under the same conditions to make comparisons fair.

If you are a buyer or commissioner of regulated AI work: a vendor that will not publish their full evaluation configuration is publishing marketing, not measurement.

Why this much rigor

The Canadian regulated AI market has been small enough that vendors could ship loose numbers without scrutiny. As Canadian AI procurement scales — federal, provincial, professional services, regulated enterprise — that era is ending. Procurement officers and RFP authors will increasingly demand reproducible measurement standards. We are publishing this protocol because we think the standards should be public and vendor-neutral, including against ourselves.

We will be judged against this. We expect to be.

Cite this

SimpleDirect® (Alpine Pacific Trading Inc.), "Model Benchmarking Methodology (v1.0)," June 2026.

Canadian AI Evaluation Methodology v1.0 — the underlying standard for what to measure on Canadian regulated work, and why.
CBLRE Evaluation Suite (Preview) — the public test set that operationalizes the Canadian-legal tracks in this protocol.

Where to next

See all four public goods Contact us

SimpleDirect®, operating as Alpine Pacific Trading Inc., is a Toronto-based team building open-weight, bilingual Canadian-context AI models you can download, run, and own.