DiffusionGemma-26B-A4B-it-AWQ-INT4 / vLLM / RTX 4090

Local benchmark notes for serving DiffusionGemma on one 4090.

This page keeps to measured behavior: throughput, long-context retrieval checks, JSON/tool behavior, and the Docker flags needed to avoid the old chunked-prefill artifact.

~8k/s
prefill at 8k-16k

Measured prompt read rate on a single-user local vLLM server.

~525-575/s
steady decode

Typical prose generation range. Dense/high-entropy output can be slower.

110/110
8k retrieval probe

Clean fp16-KV baseline with the corrected prefill setting.

15/15
18k fp8 spot check

Needle retrieval at about 16.8k prompt tokens across five positions.

Results

What passed locally

These are not broad model claims. They are the local checks run against this checkpoint and serving configuration.

area config / prompt result notes
Needle retrieval 8k max context, fp16 KV, --max-num-batched-tokens 8192 110/110 Distances swept from near the question to the front of a 7k-token prompt.
Needle retrieval 18k max context, fp8 KV, --max-num-batched-tokens 18432 15/15 Fresh spot check at about 16.8k prompt tokens; positions 5%, 25%, 50%, 75%, 92%.
Sequential data synthetic time-series/event windows, JSON answer format format 1.00 Retrieval was high but not perfect in older 8k sweeps; exact-value tasks should still validate outputs.
Structured output strict JSON extraction prompt parsed Fresh check returned valid JSON with the expected fields and values.
Tool calls OpenAI-compatible tools plus tool_choice:"auto" called tool Fresh check emitted a valid function call and completed a tool-result round trip.
Throughput

Speed is the main practical upside

The numbers are local, single-user, and content-dependent. Prefill is strong; decode is fast for a local 26B-class model.

Prefill rate

2k ctx
~4.5k/s
8k ctx
~8.5k/s
16k ctx
~8.2k/s

TTFT examples: 8k prompt about 0.93s; 16k prompt about 1.94s.

Decode rate

prose
525-575/s
dense text
~430/s
short calls
latency bound

Longer completions amortize fixed request overhead better than short answers.

Capability checks

Simple examples

These are small smoke tests, included to show API shape rather than to claim a full benchmark.

JSON output
{ "incident_code": "CYAN-4821", "severity": "high", "actions": ["isolate pump B", "notify operations", "inspect coolant line"] }
Tool call
get_sensor_window({ "asset_id": "PUMP-B", "start_minute": 120, "end_minute": 180 })
Tool result
Asset PUMP-B reached critical status with max_temp_c 87.3 and three over-limit minutes.
Code sample
Generated a compact Python rolling_zscore(values, window) function; the extracted code compiled and defined the expected function.
Use it for

Good fits and limits

good fit

Fast local generation

Drafting, summarization, structured transformations, and code snippets where high local token throughput matters.

good fit

Compact data windows

Sequential logs or time-series windows where the important semantic unit fits in a short canvas and the answer can be validated.

keep checking

Exact long-context extraction

The corrected serving config fixes the previous dead-zone artifact, but production extraction should still use schemas, checksums, or verification passes.

Run it

Docker setup

Use the Gemma diffusion vLLM image. Let vLLM auto-detect the compressed-tensors INT4 checkpoint. The important serving flag is --max-num-batched-tokens.

Download the model

pip install -U huggingface_hub

huggingface-cli download \
  cyankiwi/diffusiongemma-26B-A4B-it-AWQ-INT4 \
  --local-dir ./diffusiongemma-26B-A4B-it-AWQ-INT4
Do not pass AWQ flags. This checkpoint is compressed-tensors INT4, not AutoAWQ. Use vllm/vllm-openai:gemma and do not pass --quantization awq_marlin.

Startup string: 10k context, fp16 KV

MODEL_DIR=$PWD/diffusiongemma-26B-A4B-it-AWQ-INT4

docker run -d --name dg-10k --gpus all --ipc=host \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v "$MODEL_DIR":/model:ro -p 8001:8000 \
  vllm/vllm-openai:gemma \
    --model /model --served-model-name dg-awq \
    --max-model-len 10000 --max-num-seqs 1 \
    --gpu-memory-utilization 0.88 \
    --max-num-batched-tokens 10240 \
    --kv-cache-dtype float16 \
    --host 0.0.0.0 --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4

Startup string: 18k context, fp8 KV

MODEL_DIR=$PWD/diffusiongemma-26B-A4B-it-AWQ-INT4

docker run -d --name dg-18k --gpus all --ipc=host \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -v "$MODEL_DIR":/model:ro -p 8001:8000 \
  vllm/vllm-openai:gemma \
    --model /model --served-model-name dg-awq \
    --max-model-len 18000 --max-num-seqs 1 \
    --gpu-memory-utilization 0.90 \
    --max-num-batched-tokens 18432 \
    --kv-cache-dtype fp8 \
    --host 0.0.0.0 --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4
Check the compile range. For the 18k command, logs should include compile_ranges_endpoints: [18432]. For the 8k baseline, use --max-model-len 8000, --max-num-batched-tokens 8192, and fp16 KV.

Verify and call it

until curl -sf localhost:8001/health; do sleep 5; done
docker logs dg-18k 2>&1 | grep compile_ranges
curl -s localhost:8001/v1/models

curl -s localhost:8001/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"dg-awq",
    "messages":[{"role":"user","content":"State one practical use for block diffusion text generation."}],
    "temperature":0,
    "max_tokens":80
  }'