vLLM/Recipes
Mistral AI

mistralai/Mistral-Medium-3.5-128B

Mistral Medium 3.5 (128B) dense vision-language model with native FP8 weights and 256K context

dense128B262,144 ctxvLLM nightly+multimodal
Guide

Overview

Mistral-Medium-3.5 is a 128B dense vision-language model from Mistral AI. The weights ship pre-quantized to FP8 (E4M3) with the vision tower, multimodal projector, and lm_head retained in BF16. Image input is supported up to 1540x1540 (Pixtral-style encoder, patch size 14). Context length is 256K via YaRN scaling (factor 64x over the 4K base).

Reasoning is opt-in per request via reasoning_effort: "high" — when set, the model emits [THINK]...[/THINK] blocks that the Mistral reasoning parser surfaces as message.reasoning_content. Tool calling uses the [AVAILABLE_TOOLS] / [TOOL_CALLS] chat-template tokens.

Prerequisites

  • Hardware: 8xH200 (recommended), 4xB200 or 2-8xMI300X; single B200 / MI300X also fits the weights (~134 GB raw) but leaves little room for the 256K KV cache - see below.
  • vLLM nightly (Mistral 3.5 architecture support has not yet shipped in a stable release).

Install vLLM

uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto \
  --extra-index-url https://wheels.vllm.ai/nightly

This pulls in mistral_common >= 1.11.1 and transformers >= 5.4.0 automatically.

Launch command

8xH200 (or 8xB200):

vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --reasoning-parser mistral

Useful flags:

  • --max-model-len: default 262144; lower it (e.g. 65536) to free VRAM for larger batch sizes on tighter GPU pools.
  • --language-model-only: skip the vision encoder entirely for text-only workloads.
  • --mm-encoder-tp-mode data: run the small vision encoder data-parallel instead of tensor-parallel — avoids the all-reduce overhead.
  • --limit-mm-per-prompt.image N: cap images per request.

EAGLE speculative decoding

Mistral ships a dedicated EAGLE draft head at mistralai/Mistral-Medium-3.5-128B-EAGLE. It is not included in the default config — toggle the spec_decoding feature.

Mistral's recommended serve command (from the EAGLE model card):

vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
  --tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
  --max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 \
  --speculative_config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":"65536"}'

The draft model is a 2-layer Mistral-style head trained on the 128B target; it shares the tokenizer and runs at TP=8 alongside the target.

Client usage

Reasoning + tool calling against the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
    extra_body={"reasoning_effort": "high"},
    temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)

Image input (vision):

resp = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://..."}},
            {"type": "text", "text": "Describe this image."},
        ],
    }],
    max_tokens=512,
)

AMD MI300X (ROCm)

The command builder fits the model on one GPU on MI300X. This is prone to fail due to OOM. Use the following instead.

Single GPU - Use --tensor-parallel-size 1 and limit context, e.g. --max-model-len 131072. Without that limit, KV-cache allocation fails at the default 262144 context even in text-only mode. You can also reduce --max-num-batched-tokens or change --gpu-memory-utilization if you still hit OOM. Here is a text-only example:

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 1 \
    --no-enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --reasoning-parser mistral \
    --max-model-len 131072 \
    --language-model-only

Text-only - Set --tensor-parallel-size to the number of GPUs you wish to use:

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e SAFETENSORS_FAST_GPU=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
    --tokenizer_mode mistral \
    --config_format mistral \
    --load_format mistral \
    --tensor-parallel-size 8 \
    --no-enable-prefix-caching \
    --enable-auto-tool-choice \
    --tool-call-parser mistral \
    --reasoning-parser mistral \
    --language-model-only

For lm_eval against a text-only server, pass fix_mistral_regex=True in --model_args (Mistral 3.5 tokenizer quirk).

Multimodal (text + image) - Enable the image inputs and set the limit for images per prompt, here 1, as necessary:

docker run --device=/dev/kfd --device=/dev/dri \
  --security-opt seccomp=unconfined --group-add video \
  --privileged --ipc=host -p 8000:8000 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
    --tokenizer_mode mistral --config_format mistral --load_format mistral \
    --tensor-parallel-size 8 \
    --no-enable-prefix-caching \
    --enable-mm \
    --limit-mm-per-prompt '{"image": 1}' \
    --enable-auto-tool-choice --tool-call-parser mistral \
    --reasoning-parser mistral

MI300X benchmarking (text-only)

Serving benchmark: vllm bench serve with 100 requests, max concurrency 32, 1024 input + 1024 output tokens per request. Accuracy: lm_eval GSM8k (5-shot), local-completions backend.

TPOutput tok/sMean TTFT (ms)Mean TPOT (ms)GSM8k flexibleGSM8k strict
1438571955.20.9280.876
2530531045.00.9230.867
4729331733.00.9280.873
8915252226.10.9290.876

Troubleshooting

  • OOM at full 256K context on H200 or MI300X: drop --max-model-len to 131072 or 65536, or set --language-model-only if you don't need vision.
  • reasoning_effort rejected: only "none" and "high" are accepted by the chat template — anything else raises an exception.

References