Model Cards

Bilingual model cards for Arabic/English workloads with reproducible benchmark and cost methodology.

What this page is for

Use this page to publish DCP model cards in a comparison-ready format. Each card should explain quality, latency, cost, and VRAM requirements for Arabic-first and bilingual workloads.

Source of truth

Registry feed: `GET /api/models`
Benchmark feed: `GET /api/models/benchmarks`
Bilingual cards feed: `GET /api/models/cards`
Benchmark suite label: `saudi-arabic-v1`

Public comparison matrix (Section 3B)

8 of 8 models

Best in class Needs improvementClick rows to compare

#	Model⇅	Arabic MMLU (%)⇅	ArabicaQA (%)⇅	P95 latency (ms)⇅	Cost / 1K tokens (SAR)⇅	VRAM (GB)⇅	Cold start (ms)⇅
🥉	Phi-3 Mini	42.9	51.2	650BEST	0.62BEST	6BEST	4100BEST
#4	Mistral 7B Instruct	54.2	62.4	860#2	0.95#2	16#2	6800#2
#4	Qwen2 7B Instruct	61.4	69.8	890#3	1.02#3	16	7200#3
#5	Falcon H1 7B Instruct	64.8#3	72.1#3	930	1.18	24	8700
#5	Llama 3 8B Instruct	58.7	66.1	960	1.08	16#3	7500
#5	ALLaM 7B Instruct	67.2#2	74.8#2	990	1.32	24	9100
#5	DeepSeek R1 7B	63.1	71.5	1100	1.24	16	8900
#6	JAIS 13B Chat	70.4BEST	78.6BEST	1260	1.54	24	11600

How to interpret the metrics

`Arabic MMLU` and `ArabicaQA`: higher is better for Arabic reasoning and Q&A.
`P95 latency`: lower is better for interactive chat UX.
`Cost / 1K tokens`: lower is better for sustained high-volume workloads.
`VRAM`: minimum practical GPU memory for stable deployment.
`Cold start`: startup delay before first token when container is not warm.

Reproducible publication workflow

1) Capture fresh benchmark feed

curl -s https://dcp.sa/api/dc1/models/benchmarks

2) Capture bilingual summaries

curl -s https://dcp.sa/api/dc1/models/cards

3) Verify required fields before publishing

Each published card must include:

`model_id`, `display_name`, `family`
`metrics.latency_ms.p50/p95/p99`
`metrics.arabic_quality.arabic_mmlu_score`
`metrics.arabic_quality.arabicaqa_score`
`metrics.cost_per_1k_tokens_halala` and `metrics.cost_per_1k_tokens_sar`
`metrics.vram_required_gb`
`metrics.cold_start_ms`
`summary.en` and `summary.ar`
`benchmark_suite` and `measured_at`
`tier`, `launch_priority`, and `prewarm_class` for launch ordering
`readiness.launch_ready` and readiness targets for Section 7 go/no-go checks

Card template (EN/AR)

### <Display Name> (`<model_id>`)
- **Arabic quality**: MMLU <x>% · ArabicaQA <y>%
- **Latency**: P50 <x> ms · P95 <y> ms · P99 <z> ms
- **Cost**: <halala> halala / 1K tokens (<sar> SAR)
- **Deployment**: VRAM <x> GB · Cold start <y> ms
- **Best use**: <one sentence>
- **Summary (EN)**: <summary.en>
- **الملخص (AR)**: <summary.ar>

Publication policy notes

Publish EN and AR cards in the same release window.
Keep `benchmark_suite` visible so readers know which benchmark generation produced the numbers.
If methodology changes, publish a new suite label (for example `saudi-arabic-v2`) instead of mixing metrics across suites.

Docs