Model Cards

Bilingual model cards for Arabic/English workloads with reproducible benchmark and cost methodology.

What this page is for

Use this page to publish DCP model cards in a comparison-ready format. Each card should explain quality, latency, cost, and VRAM requirements for Arabic-first and bilingual workloads.

Source of truth

  • Registry feed: `GET /api/models`
  • Benchmark feed: `GET /api/models/benchmarks`
  • Bilingual cards feed: `GET /api/models/cards`
  • Benchmark suite label: `saudi-arabic-v1`

Public comparison matrix (Section 3B)

8 of 8 models
Best in class Needs improvementClick rows to compare
#ModelArabic MMLU (%)ArabicaQA (%)P95 latency (ms)Cost / 1K tokens (SAR)VRAM (GB)Cold start (ms)
🥉Phi-3 Mini
42.9
51.2
650BEST
0.62BEST
6BEST
4100BEST
#4Mistral 7B Instruct
54.2
62.4
860#2
0.95#2
16#2
6800#2
#4Qwen2 7B Instruct
61.4
69.8
890#3
1.02#3
16
7200#3
#5Falcon H1 7B Instruct
64.8#3
72.1#3
930
1.18
24
8700
#5Llama 3 8B Instruct
58.7
66.1
960
1.08
16#3
7500
#5ALLaM 7B Instruct
67.2#2
74.8#2
990
1.32
24
9100
#5DeepSeek R1 7B
63.1
71.5
1100
1.24
16
8900
#6JAIS 13B Chat
70.4BEST
78.6BEST
1260
1.54
24
11600

How to interpret the metrics

  • `Arabic MMLU` and `ArabicaQA`: higher is better for Arabic reasoning and Q&A.
  • `P95 latency`: lower is better for interactive chat UX.
  • `Cost / 1K tokens`: lower is better for sustained high-volume workloads.
  • `VRAM`: minimum practical GPU memory for stable deployment.
  • `Cold start`: startup delay before first token when container is not warm.

Recommended reading order for buyers

  1. Filter by minimum Arabic quality targets (`Arabic MMLU` and `ArabicaQA`).
  2. Remove models that exceed latency SLO at `P95`.
  3. Compare remaining models by `Cost / 1K tokens`.
  4. Confirm the selected model fits provider GPU VRAM and prewarm policy.

Reproducible publication workflow

1) Capture fresh benchmark feed

curl -s https://dcp.sa/api/dc1/models/benchmarks

2) Capture bilingual summaries

curl -s https://dcp.sa/api/dc1/models/cards

3) Verify required fields before publishing

Each published card must include:

  • `model_id`, `display_name`, `family`
  • `metrics.latency_ms.p50/p95/p99`
  • `metrics.arabic_quality.arabic_mmlu_score`
  • `metrics.arabic_quality.arabicaqa_score`
  • `metrics.cost_per_1k_tokens_halala` and `metrics.cost_per_1k_tokens_sar`
  • `metrics.vram_required_gb`
  • `metrics.cold_start_ms`
  • `summary.en` and `summary.ar`
  • `benchmark_suite` and `measured_at`
  • `tier`, `launch_priority`, and `prewarm_class` for launch ordering
  • `readiness.launch_ready` and readiness targets for Section 7 go/no-go checks

Card template (EN/AR)

### <Display Name> (`<model_id>`)
- **Arabic quality**: MMLU <x>% · ArabicaQA <y>%
- **Latency**: P50 <x> ms · P95 <y> ms · P99 <z> ms
- **Cost**: <halala> halala / 1K tokens (<sar> SAR)
- **Deployment**: VRAM <x> GB · Cold start <y> ms
- **Best use**: <one sentence>
- **Summary (EN)**: <summary.en>
- **الملخص (AR)**: <summary.ar>

Publication policy notes

  • Publish EN and AR cards in the same release window.
  • Keep `benchmark_suite` visible so readers know which benchmark generation produced the numbers.
  • If methodology changes, publish a new suite label (for example `saudi-arabic-v2`) instead of mixing metrics across suites.