Choosing a Backend Β· The Infernet Book

Choosing a Backend

Decision Guide

Start with these questions:

Do you have an NVIDIA GPU and want maximum throughput? β†’ Use vLLM or SGLang. They’re neck-and-neck on throughput. SGLang wins when requests share a common system prompt or context. vLLM wins when requests are highly varied.

Are you on Apple Silicon or need CPU inference? β†’ Use llama.cpp (via llama-swap). It has the best Metal support and GGUF quantization makes large models practical.

Do you have an AMD GPU? β†’ Use Ollama. It has the most mature ROCm support.

Do you want the easiest setup with broad hardware support? β†’ Use Ollama. It works everywhere, installs in one command, and handles model management well.

Do you want to benchmark before committing? β†’ Use Ollama to get started, then swap to vLLM or MAX and measure. The swap is a single config line change.


Comparison Table

Feature Ollama vLLM SGLang MAX llama.cpp
NVIDIA CUDA Yes Yes Yes Yes Yes
AMD ROCm Yes Partial No Preview Yes
Apple Silicon Yes No No No Yes
CPU Yes No No No Yes
Install complexity Low Medium Medium Medium Medium
Python required No Yes Yes Yes No
Model format GGUF/GGML HuggingFace HuggingFace HuggingFace GGUF
Multi-GPU Limited Yes (Ray) Yes Planned No
KV-cache reuse No No Yes (Radix) Partial No
Continuous batching Yes Yes Yes Yes Limited
Structured output Limited Yes Yes (fast) Partial Yes
Speculative decoding No Yes Yes Planned Yes
Memory efficiency Good Excellent Excellent Excellent Good
Relative throughput (7B) 1.0x 1.4x 1.5x 1.5x 0.8x

Throughput numbers are normalized to Ollama = 1.0x for a single NVIDIA GPU. Actual results vary by model and workload.


Hardware-Specific Recommendations

NVIDIA RTX 4090 (24GB)

Best: vLLM or SGLang for production throughput. Ollama for easy setup or mixed workloads.

# Primary: vLLM
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-14B-Instruct \
  --port 8000

# Config
export INFERNET_BACKEND=vllm

NVIDIA H100 (80GB)

Best: vLLM with tensor parallelism, or MAX for newer architectures.

# vLLM single-GPU for 70B models
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-72B-Instruct \
  --port 8000

AMD RX 7900 XTX (24GB)

Best: Ollama (most stable ROCm support).

infernet setup  # Ollama will be auto-selected

Apple Silicon M2/M3

Best: Ollama for convenience, llama.cpp for maximum performance.

# Ollama (easiest)
ollama pull qwen2.5:14b

# llama.cpp for performance-critical nodes
llama-server \
  --model qwen2.5-14b-q4km.gguf \
  --n-gpu-layers 99 \
  --port 8080

CPU Only

Best: llama.cpp with Q4_K_M quantization.

llama-server \
  --model qwen2.5-7b-q4km.gguf \
  --n-gpu-layers 0 \
  --threads $(nproc) \
  --port 8080

Switching Backends

Switching is a one-line config change:

{
  "backend": "vllm",
  "vllm_host": "http://localhost:8000",
  "vllm_model": "Qwen/Qwen2.5-14B-Instruct"
}

Then restart the daemon:

infernet service restart
# or
infernet stop && infernet start

The control plane is updated on the next heartbeat. No re-registration required.


Benchmarking Your Setup

Before committing to a backend, benchmark it with realistic traffic:

# Install benchmark tool
pip install "sglang[benchmark]"

# Benchmark with 16 concurrent users, 512 output tokens each
python -m sglang.bench_serving \
  --backend openai \
  --base-url http://localhost:8000 \
  --model Qwen/Qwen2.5-14B-Instruct \
  --num-prompts 100 \
  --request-rate 16 \
  --output-len 512

Key metrics to compare: - Throughput (tokens/second): total output tokens divided by total time - First token latency (TTFT): time from request to first token β€” affects perceived responsiveness - Inter-token latency: time between tokens in a stream β€” should be consistent - P99 latency: worst-case latency under load