Start with these questions:
Do you have an NVIDIA GPU and want maximum throughput? β Use vLLM or SGLang. Theyβre neck-and-neck on throughput. SGLang wins when requests share a common system prompt or context. vLLM wins when requests are highly varied.
Are you on Apple Silicon or need CPU inference? β Use llama.cpp (via llama-swap). It has the best Metal support and GGUF quantization makes large models practical.
Do you have an AMD GPU? β Use Ollama. It has the most mature ROCm support.
Do you want the easiest setup with broad hardware support? β Use Ollama. It works everywhere, installs in one command, and handles model management well.
Do you want to benchmark before committing? β Use Ollama to get started, then swap to vLLM or MAX and measure. The swap is a single config line change.
| Feature | Ollama | vLLM | SGLang | MAX | llama.cpp |
|---|---|---|---|---|---|
| NVIDIA CUDA | Yes | Yes | Yes | Yes | Yes |
| AMD ROCm | Yes | Partial | No | Preview | Yes |
| Apple Silicon | Yes | No | No | No | Yes |
| CPU | Yes | No | No | No | Yes |
| Install complexity | Low | Medium | Medium | Medium | Medium |
| Python required | No | Yes | Yes | Yes | No |
| Model format | GGUF/GGML | HuggingFace | HuggingFace | HuggingFace | GGUF |
| Multi-GPU | Limited | Yes (Ray) | Yes | Planned | No |
| KV-cache reuse | No | No | Yes (Radix) | Partial | No |
| Continuous batching | Yes | Yes | Yes | Yes | Limited |
| Structured output | Limited | Yes | Yes (fast) | Partial | Yes |
| Speculative decoding | No | Yes | Yes | Planned | Yes |
| Memory efficiency | Good | Excellent | Excellent | Excellent | Good |
| Relative throughput (7B) | 1.0x | 1.4x | 1.5x | 1.5x | 0.8x |
Throughput numbers are normalized to Ollama = 1.0x for a single NVIDIA GPU. Actual results vary by model and workload.
Best: vLLM or SGLang for production throughput. Ollama for easy setup or mixed workloads.
# Primary: vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct \
--port 8000
# Config
export INFERNET_BACKEND=vllmBest: vLLM with tensor parallelism, or MAX for newer architectures.
# vLLM single-GPU for 70B models
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct \
--port 8000Best: Ollama (most stable ROCm support).
infernet setup # Ollama will be auto-selectedBest: Ollama for convenience, llama.cpp for maximum performance.
# Ollama (easiest)
ollama pull qwen2.5:14b
# llama.cpp for performance-critical nodes
llama-server \
--model qwen2.5-14b-q4km.gguf \
--n-gpu-layers 99 \
--port 8080Best: llama.cpp with Q4_K_M quantization.
llama-server \
--model qwen2.5-7b-q4km.gguf \
--n-gpu-layers 0 \
--threads $(nproc) \
--port 8080Switching is a one-line config change:
{
"backend": "vllm",
"vllm_host": "http://localhost:8000",
"vllm_model": "Qwen/Qwen2.5-14B-Instruct"
}Then restart the daemon:
infernet service restart
# or
infernet stop && infernet startThe control plane is updated on the next heartbeat. No re-registration required.
Before committing to a backend, benchmark it with realistic traffic:
# Install benchmark tool
pip install "sglang[benchmark]"
# Benchmark with 16 concurrent users, 512 output tokens each
python -m sglang.bench_serving \
--backend openai \
--base-url http://localhost:8000 \
--model Qwen/Qwen2.5-14B-Instruct \
--num-prompts 100 \
--request-rate 16 \
--output-len 512Key metrics to compare: - Throughput (tokens/second): total output tokens divided by total time - First token latency (TTFT): time from request to first token β affects perceived responsiveness - Inter-token latency: time between tokens in a stream β should be consistent - P99 latency: worst-case latency under load