Ollama is the default backend. It handles NVIDIA, AMD (ROCm), Apple Silicon (Metal), and CPU inference with no additional configuration. If you’re not sure which backend to use, start with Ollama.
infernet setup installs Ollama automatically if no
backend is detected. To install it manually:
curl -fsSL https://ollama.com/install.sh | shThis installs the ollama binary and configures it as a
systemd service.
Verify the install:
ollama --version
# ollama version 0.5.4
ollama list
# NAME ID SIZE MODIFIEDOllama manages its own model registry. Pull models before starting the Infernet daemon:
ollama pull qwen2.5:14b
ollama pull qwen2.5:7b
ollama pull llama3.2:3b
ollama pull nomic-embed-textOllama auto-selects quantization based on your available VRAM. To pull a specific quantization:
# Q4_K_M — good balance of quality and size
ollama pull qwen2.5:14b-instruct-q4_K_M
# Q8 — higher quality, needs more VRAM
ollama pull qwen2.5:14b-instruct-q8_0
# Full precision FP16
ollama pull qwen2.5:14b-instruct-fp16List your pulled models:
ollama listNAME ID SIZE MODIFIED
qwen2.5:14b d8c5e0b67b1e 8.1 GB 2 hours ago
llama3.2:3b a80c4f17acd5 2.0 GB 1 day ago
nomic-embed-text:latest 0a109f422b47 274 MB 3 days ago
Ollama uses CUDA for NVIDIA GPUs. CUDA is bundled with Ollama — you don’t need to install it separately (though having it installed doesn’t hurt). You do need the NVIDIA driver.
# Verify NVIDIA GPU is detected
nvidia-smi
ollama run qwen2.5:7b "hello" 2>&1 | head -5Ollama will use all available NVIDIA GPUs. To restrict to specific GPUs:
CUDA_VISIBLE_DEVICES=0 ollama serveROCm support is included in the official Ollama Linux build. Works with RX 6000 series and newer:
# Verify ROCm is detected
rocm-smi
ollama run qwen2.5:7b "hello"If Ollama doesn’t detect your AMD GPU, check:
# Should list your GPU
/opt/rocm/bin/rocminfo | grep "Agent 2" -A 10Older GCN architecture cards may need
HSA_OVERRIDE_GFX_VERSION:
HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serveOllama uses Metal on Apple Silicon. No configuration needed:
ollama run qwen2.5:7b "hello"
# Uses GPU automaticallyOllama treats the unified memory as both system RAM and VRAM, so a MacBook Pro with 36GB unified memory can run larger models than an NVIDIA GPU with 24GB VRAM.
If no GPU is detected, Ollama falls back to CPU inference using highly optimized GGML kernels. It’s slow but works:
OLLAMA_NUM_GPU=0 ollama serve # Force CPU mode# Ollama server address (default: http://127.0.0.1:11434)
OLLAMA_HOST=0.0.0.0:11434
# Keep models in VRAM between requests (0 = unload after use, -1 = never unload)
OLLAMA_KEEP_ALIVE=5m
# Number of parallel requests per model (default: auto based on VRAM)
OLLAMA_NUM_PARALLEL=4
# Max number of models loaded simultaneously
OLLAMA_MAX_LOADED_MODELS=3
# Context window size override
OLLAMA_NUM_CTX=8192
# Force CPU-only mode
OLLAMA_NUM_GPU=0
# Custom models directory
OLLAMA_MODELS=/mnt/fast-nvme/ollama/modelsSet these in
/etc/systemd/system/ollama.service.d/override.conf for
persistent configuration:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MODELS=/mnt/fast-nvme/ollama/models"sudo systemctl daemon-reload
sudo systemctl restart ollamaIn your Infernet config (~/.infernet/config.json):
{
"backend": "ollama",
"ollama_host": "http://localhost:11434"
}If Ollama is running on a different machine or port:
{
"backend": "ollama",
"ollama_host": "http://192.168.1.50:11434"
}Ollama not starting:
sudo systemctl status ollama
journalctl -u ollama -n 50GPU not detected:
# NVIDIA
nvidia-smi # Should show GPU
ls /dev/nvidia* # Should show devices
# AMD
rocm-smi --showid
ls /dev/kfd # Should exist for ROCmSlow inference (possible CPU fallback):
# Run with verbose to see what device is used
OLLAMA_DEBUG=1 ollama run qwen2.5:7b "hello" 2>&1 | grep -i gpuOut of memory:
# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.free --format=csv
# Unload all models
ollama stop --all