vLLM is a high-throughput inference engine for NVIDIA GPUs. It uses PagedAttention to manage KV-cache memory more efficiently than naive implementations, enabling higher concurrency and better GPU utilization under load.
Use vLLM when you’re getting steady job volume and want to maximize tokens/second per dollar of GPU cost.
pip install vllmFor a specific CUDA version:
# CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
# CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118Using Docker (recommended for production):
docker pull vllm/vllm-openai:latestVerify the install:
python -c "import vllm; print(vllm.__version__)"vLLM exposes an OpenAI-compatible REST API. Start it pointing at your model:
# HuggingFace model
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen2.5:14b
# With Docker
docker run --gpus all --rm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-14B-Instruct \
--served-model-name qwen2.5:14bTest it’s working:
curl http://localhost:8000/health
# {"status":"ok"}
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "qwen2.5:14b", "prompt": "Hello", "max_tokens": 10}'PagedAttention is vLLM’s key performance feature. Normal KV-cache implementations allocate a fixed block of memory per sequence upfront (to handle the maximum context length). Most of this allocation is wasted for short sequences.
PagedAttention uses virtual memory pages instead: it allocates KV-cache in small pages and only pages in what’s actually needed. This enables:
The result is 2–4x higher throughput than naive implementations at the same VRAM budget.
You don’t need to configure PagedAttention — it’s on by default. The only relevant setting is the block size:
# Default block size is 16 tokens per page
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--block-size 16Larger block sizes (32) improve throughput for long sequences. Smaller block sizes (8) improve memory efficiency for short sequences.
For models larger than a single GPU’s VRAM, vLLM uses Ray for tensor parallelism:
# Install Ray
pip install ray
# Start with 2-GPU tensor parallelism
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 2 \
--host 0.0.0.0 \
--port 8000For 4 GPUs:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--tensor-parallel-size 4 \
--host 0.0.0.0 \
--port 8000See Chapter 6: Multi-GPU for more on multi-GPU configurations.
# Model server address
VLLM_HOST=http://localhost:8000
# Default model name (used if serving a single model)
VLLM_MODEL=Qwen/Qwen2.5-14B-Instruct
# HuggingFace token (for gated models like Llama)
HF_TOKEN=hf_your_token_here
# CUDA device selection
CUDA_VISIBLE_DEVICES=0,1
# Disable usage stats reporting
VLLM_NO_USAGE_STATS=1{
"backend": "vllm",
"vllm_host": "http://localhost:8000",
"vllm_model": "Qwen/Qwen2.5-14B-Instruct"
}# Increase max concurrent requests (default: 256)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--max-num-seqs 512
# Limit GPU memory fraction (default: 0.90)
# Useful if you want headroom for other processes
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--gpu-memory-utilization 0.85
# Quantization (reduces VRAM at some quality cost)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--quantization awq
# Use FP8 KV cache (reduces VRAM for KV cache)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--kv-cache-dtype fp8Unlike Ollama, a single vLLM server instance typically serves one model. To serve multiple models, run multiple instances on different ports:
# Model 1 on port 8000
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--port 8000 &
# Model 2 on port 8001
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8001 &The Infernet daemon handles routing to the correct port based on the requested model.
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
Type=simple
User=infernet
ExecStart=/usr/bin/python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--served-model-name qwen2.5:14b
Restart=always
RestartSec=10
Environment=CUDA_VISIBLE_DEVICES=0
[Install]
WantedBy=multi-user.target