Model Cards
Bilingual model cards for Arabic/English workloads with reproducible benchmark and cost methodology.
What this page is for
Use this page to publish DCP model cards in a comparison-ready format. Each card should explain quality, latency, cost, and VRAM requirements for Arabic-first and bilingual workloads.
Source of truth
- Registry feed: `GET /api/models`
- Benchmark feed: `GET /api/models/benchmarks`
- Bilingual cards feed: `GET /api/models/cards`
- Benchmark suite label: `saudi-arabic-v1`
Public comparison matrix (Section 3B)
8 of 8 models
Best in class Needs improvementClick rows to compare
| # | Model⇅ | Arabic MMLU (%)⇅ | ArabicaQA (%)⇅ | P95 latency (ms)⇅ | Cost / 1K tokens (SAR)⇅ | VRAM (GB)⇅ | Cold start (ms)⇅ | |
|---|---|---|---|---|---|---|---|---|
| 🥉 | Phi-3 Mini | 42.9 | 51.2 | 650BEST | 0.62BEST | 6BEST | 4100BEST | |
| #4 | Mistral 7B Instruct | 54.2 | 62.4 | 860#2 | 0.95#2 | 16#2 | 6800#2 | |
| #4 | Qwen2 7B Instruct | 61.4 | 69.8 | 890#3 | 1.02#3 | 16 | 7200#3 | |
| #5 | Falcon H1 7B Instruct | 64.8#3 | 72.1#3 | 930 | 1.18 | 24 | 8700 | |
| #5 | Llama 3 8B Instruct | 58.7 | 66.1 | 960 | 1.08 | 16#3 | 7500 | |
| #5 | ALLaM 7B Instruct | 67.2#2 | 74.8#2 | 990 | 1.32 | 24 | 9100 | |
| #5 | DeepSeek R1 7B | 63.1 | 71.5 | 1100 | 1.24 | 16 | 8900 | |
| #6 | JAIS 13B Chat | 70.4BEST | 78.6BEST | 1260 | 1.54 | 24 | 11600 |
How to interpret the metrics
- `Arabic MMLU` and `ArabicaQA`: higher is better for Arabic reasoning and Q&A.
- `P95 latency`: lower is better for interactive chat UX.
- `Cost / 1K tokens`: lower is better for sustained high-volume workloads.
- `VRAM`: minimum practical GPU memory for stable deployment.
- `Cold start`: startup delay before first token when container is not warm.
Recommended reading order for buyers
- Filter by minimum Arabic quality targets (`Arabic MMLU` and `ArabicaQA`).
- Remove models that exceed latency SLO at `P95`.
- Compare remaining models by `Cost / 1K tokens`.
- Confirm the selected model fits provider GPU VRAM and prewarm policy.
Reproducible publication workflow
1) Capture fresh benchmark feed
curl -s https://dcp.sa/api/dc1/models/benchmarks2) Capture bilingual summaries
curl -s https://dcp.sa/api/dc1/models/cards3) Verify required fields before publishing
Each published card must include:
- `model_id`, `display_name`, `family`
- `metrics.latency_ms.p50/p95/p99`
- `metrics.arabic_quality.arabic_mmlu_score`
- `metrics.arabic_quality.arabicaqa_score`
- `metrics.cost_per_1k_tokens_halala` and `metrics.cost_per_1k_tokens_sar`
- `metrics.vram_required_gb`
- `metrics.cold_start_ms`
- `summary.en` and `summary.ar`
- `benchmark_suite` and `measured_at`
- `tier`, `launch_priority`, and `prewarm_class` for launch ordering
- `readiness.launch_ready` and readiness targets for Section 7 go/no-go checks
Card template (EN/AR)
### <Display Name> (`<model_id>`)
- **Arabic quality**: MMLU <x>% · ArabicaQA <y>%
- **Latency**: P50 <x> ms · P95 <y> ms · P99 <z> ms
- **Cost**: <halala> halala / 1K tokens (<sar> SAR)
- **Deployment**: VRAM <x> GB · Cold start <y> ms
- **Best use**: <one sentence>
- **Summary (EN)**: <summary.en>
- **الملخص (AR)**: <summary.ar>Publication policy notes
- Publish EN and AR cards in the same release window.
- Keep `benchmark_suite` visible so readers know which benchmark generation produced the numbers.
- If methodology changes, publish a new suite label (for example `saudi-arabic-v2`) instead of mixing metrics across suites.