Why this document exists
Most AI model launches publish numbers. Few publish the protocol that produced them.
This document specifies the full benchmarking protocol applied to SimpleDirect models — the complete evaluation suite, harnesses, serving configuration, scoring rules, and reporting standards. It is the measurement companion to the Canadian AI Evaluation Methodology, which defines the Canadian-legal tracks specifically.
Its purpose is simple: make every published number reproducible and every comparison fair.
If you cannot regenerate our scores from our published configuration, we did not benchmark; we marketed. We have tried to make sure you can regenerate them.
Five principles
Identical conditions for comparison. A fine-tuned model and its base are evaluated with the same prompts, few-shot counts, scoring code, decoding settings, and serving stack. Differences in results are then attributable to the model, not the test setup.
Reproducibility. Every score is regenerable from published items/tasks, the harness version, and fixed decoding (greedy, temperature 0, fixed seed).
Robust scoring over fragile parsing. Loglikelihood multiple-choice scoring is preferred. Where generation scoring is required, the extractor takes the model's final committed answer and is validated for stability across token budgets.
Per-capability reporting, no blended score. Results are reported per benchmark and per track. Regressions are reported with the same prominence as gains.
Leak-proofing for retrieval. Retrieval evaluation uses date-held-out sources constructed so answers cannot be string-matched from the prompt.
The evaluation suite
Every SimpleDirect model is measured across seven families, chosen so that specialization gains and any capability regressions are both visible.
| Family | What it covers | Harness / source |
|---|---|---|
| General capability | MMLU, ARC-Challenge, HellaSwag, TruthfulQA (MC1/MC2), GSM8K, BBH | lm-evaluation-harness |
| French general | Belebele FR, MGSM FR, ARC FR, HellaSwag FR | lm-evaluation-harness |
| Legal knowledge | MMLU professional / international law / jurisprudence; Global-MMLU FR equivalents | lm-evaluation-harness |
| Instruction-following | IFEval (prompt-strict) | lm-evaluation-harness |
| Canadian legal (CBLRE) | 6 tracks: common law, Quebec civil law, Charter, privacy, citation, safety | SimpleDirect CBLRE + scorer |
| Retrieval (RAG) | Leak-proof held-out Canadian source attribution | SimpleDirect held-out set |
| Function-calling | BFCL v4 (single- and multi-turn) | Berkeley Function-Calling Leaderboard |
The suite is run in full on every model. No subset is cherry-picked. Specialization tradeoffs — including any regressions from base — are visible because the full set is run and reported.
Serving configuration
Models are served identically for the model under test and the base comparison:
| Setting | Value |
|---|---|
| Engine | vLLM (OpenAI-compatible endpoint) |
| Precision | bf16 |
| Decoding | Greedy, temperature 0, fixed seed |
| Context length | lm-eval tasks: 4,096; function-calling: 32,768 |
| Flags | trust-remote-code; deterministic sampler |
| Isolation | Each model served from its own pinned GPU; no cross-contention during parallel runs |
These settings are published not because they are exotic, but because they are the exact configuration under which our numbers were produced. Anyone with access to the model, the items, and our scoring code should be able to reproduce our scores within the noise floor of greedy decoding.
Few-shot settings
Few-shot counts are fixed per task and held identical between the model under test and its base:
| Task | Few-shot |
|---|---|
| MMLU / legal-MMLU / Global-MMLU FR | 5-shot |
| ARC-Challenge / ARC FR | 25-shot |
| HellaSwag / HellaSwag FR | 10-shot |
| TruthfulQA (MC1/MC2) | 0-shot |
| GSM8K | 5-shot |
| BBH | 3-shot |
| Belebele FR | 5-shot |
| MGSM FR | 8-shot |
| IFEval | 0-shot |
Scoring rules
Multiple-choice. Loglikelihood scoring where the harness supports it (selects the highest-probability option directly). Where generation scoring is unavoidable, the extractor takes the final committed answer and is validated to produce stable scores across token budgets — a scorer whose output changes with response length is treated as a defect.
Citation. Citation-track scoring validates a correct, well-formed legal citation against a reference (pattern + identity), not a single letter — so it measures citation production, not multiple-choice luck.
Retrieval. Exact-match on source identity over a leak-proof held-out set; the random baseline is reported alongside the score. Parse-rate (share of outputs yielding a scorable answer) is reported separately from accuracy, since they are distinct quality signals.
Function-calling. BFCL v4 standard scoring, reported as overall plus sub-scores (non-live, live, multi-turn). Multi-turn is reported explicitly because it is the most demanding and most discriminating sub-category.
Bilingual parity. For bilingual-paired tracks, EN and FR accuracy are reported separately along with the parity ratio (FR/EN). Parity is never averaged away into a single number.
The reporting standard
Every published result states:
- The exact model checkpoint and its base
- Per-benchmark and per-track scores (no blended headline)
- Few-shot counts and decoding
- Random baselines for retrieval
- Bilingual parity ratios where applicable
- Validation status for any preview-grade items
- Gains and regressions with equal prominence
A result that cannot be reproduced from the published configuration is not published. That is the rule.
The acceptance flow
A model reaches benchmarking only after passing the build-stage verification gates: weight-movement audit, key integrity, generation smoke test, multimodal check. It is then served under the configuration above, evaluated across the full suite, and the complete result set — including any regressions — is compiled into the benchmark report.
No subset is cherry-picked. The suite is run in full so that specialization tradeoffs are visible.
What this means for you
If you are a procurement officer: you can specify this methodology as the evaluation standard in your RFP. The protocol is vendor-neutral and citable. When we report model scores, you can verify them against the published items, code, and configuration.
If you are an AI researcher or developer: use this protocol as a template for evaluating your own Canadian-context models. Report your scores under the same conditions to make comparisons fair.
If you are a buyer or commissioner of regulated AI work: a vendor that will not publish their full evaluation configuration is publishing marketing, not measurement.
Why this much rigor
The Canadian regulated AI market has been small enough that vendors could ship loose numbers without scrutiny. As Canadian AI procurement scales — federal, provincial, professional services, regulated enterprise — that era is ending. Procurement officers and RFP authors will increasingly demand reproducible measurement standards. We are publishing this protocol because we think the standards should be public and vendor-neutral, including against ourselves.
We will be judged against this. We expect to be.
Cite this
SimpleDirect® (Alpine Pacific Trading Inc.), "Model Benchmarking Methodology (v1.0)," June 2026.
Read more
- Canadian AI Evaluation Methodology v1.0 — the underlying standard for what to measure on Canadian regulated work, and why.
- CBLRE Evaluation Suite (Preview) — the public test set that operationalizes the Canadian-legal tracks in this protocol.
Where to next