Measured prompt read rate on a single-user local vLLM server.
Typical prose generation range. Dense/high-entropy output can be slower.
Clean fp16-KV baseline with the corrected prefill setting.
Needle retrieval at about 16.8k prompt tokens across five positions.
What passed locally
These are not broad model claims. They are the local checks run against this checkpoint and serving configuration.
| area | config / prompt | result | notes |
|---|---|---|---|
| Needle retrieval | 8k max context, fp16 KV, --max-num-batched-tokens 8192 |
110/110 | Distances swept from near the question to the front of a 7k-token prompt. |
| Needle retrieval | 18k max context, fp8 KV, --max-num-batched-tokens 18432 |
15/15 | Fresh spot check at about 16.8k prompt tokens; positions 5%, 25%, 50%, 75%, 92%. |
| Sequential data | synthetic time-series/event windows, JSON answer format | format 1.00 | Retrieval was high but not perfect in older 8k sweeps; exact-value tasks should still validate outputs. |
| Structured output | strict JSON extraction prompt | parsed | Fresh check returned valid JSON with the expected fields and values. |
| Tool calls | OpenAI-compatible tools plus tool_choice:"auto" |
called tool | Fresh check emitted a valid function call and completed a tool-result round trip. |
Speed is the main practical upside
The numbers are local, single-user, and content-dependent. Prefill is strong; decode is fast for a local 26B-class model.
Prefill rate
TTFT examples: 8k prompt about 0.93s; 16k prompt about 1.94s.
Decode rate
Longer completions amortize fixed request overhead better than short answers.
Simple examples
These are small smoke tests, included to show API shape rather than to claim a full benchmark.
rolling_zscore(values, window) function; the extracted code compiled and defined the expected function.Good fits and limits
Fast local generation
Drafting, summarization, structured transformations, and code snippets where high local token throughput matters.
Compact data windows
Sequential logs or time-series windows where the important semantic unit fits in a short canvas and the answer can be validated.
Exact long-context extraction
The corrected serving config fixes the previous dead-zone artifact, but production extraction should still use schemas, checksums, or verification passes.
Docker setup
Use the Gemma diffusion vLLM image. Let vLLM auto-detect the compressed-tensors INT4 checkpoint. The important serving flag is --max-num-batched-tokens.
Download the model
pip install -U huggingface_hub huggingface-cli download \ cyankiwi/diffusiongemma-26B-A4B-it-AWQ-INT4 \ --local-dir ./diffusiongemma-26B-A4B-it-AWQ-INT4
vllm/vllm-openai:gemma and do not pass --quantization awq_marlin.
Startup string: 10k context, fp16 KV
MODEL_DIR=$PWD/diffusiongemma-26B-A4B-it-AWQ-INT4
docker run -d --name dg-10k --gpus all --ipc=host \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v "$MODEL_DIR":/model:ro -p 8001:8000 \
vllm/vllm-openai:gemma \
--model /model --served-model-name dg-awq \
--max-model-len 10000 --max-num-seqs 1 \
--gpu-memory-utilization 0.88 \
--max-num-batched-tokens 10240 \
--kv-cache-dtype float16 \
--host 0.0.0.0 --port 8000 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
Startup string: 18k context, fp8 KV
MODEL_DIR=$PWD/diffusiongemma-26B-A4B-it-AWQ-INT4
docker run -d --name dg-18k --gpus all --ipc=host \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-v "$MODEL_DIR":/model:ro -p 8001:8000 \
vllm/vllm-openai:gemma \
--model /model --served-model-name dg-awq \
--max-model-len 18000 --max-num-seqs 1 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 18432 \
--kv-cache-dtype fp8 \
--host 0.0.0.0 --port 8000 \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
compile_ranges_endpoints: [18432]. For the 8k baseline, use --max-model-len 8000, --max-num-batched-tokens 8192, and fp16 KV.
Verify and call it
until curl -sf localhost:8001/health; do sleep 5; done
docker logs dg-18k 2>&1 | grep compile_ranges
curl -s localhost:8001/v1/models
curl -s localhost:8001/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"dg-awq",
"messages":[{"role":"user","content":"State one practical use for block diffusion text generation."}],
"temperature":0,
"max_tokens":80
}'