Best VPS for Ollama (2025) — Run Local LLMs
Ollama has revolutionized the way developers and businesses run large language models. For broader AI workflows, see VPS for AI Agents by making it trivially simple to download, manage, and serve models like Llama 3, Mistral, and CodeLlama on any Linux machine. Instead of dealing with complex Python environments, CUDA dependencies, and model conversion scripts, Ollama handles everything with a single command. But the ease of setup means the bottleneck shifts entirely to your hardware — and choosing the right VPS is the difference between a responsive chatbot and a server that crawls at two tokens per second.
This guide covers everything you need to know about running Ollama on a VPS in 2025: hardware requirements for every popular model, step-by-step installation, performance benchmarks, optimization techniques, and a comparison of the best VPS providers. Whether you want to build a private ChatGPT alternative or power an AI-powered application, this guide will help you select and configure the right server.
Why Run Ollama on a VPS
Running Ollama locally on your laptop works for experimentation, but production deployments demand a server. A VPS provides 24/7 availability, consistent bandwidth, and dedicated resources that a laptop cannot match. When your AI-powered application handles requests from users across multiple time zones, downtime is not an option. For European hosting with low latency, consider a Germany VPS or Netherlands VPS.
Cost is another compelling reason. API-based LLM services like OpenAI charge per token — a moderately trafficked application can quickly rack up bills of $200-500 per month. Running Ollama on a VPS gives you flat-rate pricing. A $20 VPS can serve thousands of requests per day with zero per-token costs. For teams running multiple agents or processing large document batches, the savings compound rapidly.
Data privacy is increasingly important for organizations in regulated industries. When you send prompts to OpenAI or Anthropic, that data is processed on infrastructure you do not control and may be used for model training. Running Ollama on your own VPS means your data never leaves your server. This is essential for healthcare, legal, financial, and any domain handling sensitive information subject to GDPR or HIPAA requirements.
Hardware Requirements for Ollama Models
Ollama runs entirely on CPU, which means RAM is your most critical resource. The model weights must be loaded into memory before inference can begin. Quantized models (GGUF format) reduce memory requirements by compressing weights from 16-bit floating point to 4-bit integers, but the memory footprint is still substantial for larger models. The following table lists the RAM requirements for the most popular models in 2025.
| Model | Parameters | Quantization | RAM Required | Min VPS Plan | Best Use Case |
|---|---|---|---|---|---|
| Phi-3 Mini | 3.8B | Q4_K_M | ~2.5 GB | 1 vCPU / 4 GB | Simple chatbots, classification tasks |
| Gemma 2 2B | 2B | Q4_K_M | ~1.5 GB | 1 vCPU / 2 GB | Lightweight tasks, embedded systems |
| Llama 3 8B | 8B | Q4_K_M | ~4.8 GB | 2 vCPU / 8 GB | General-purpose chat, RAG pipelines |
| Mistral 7B | 7B | Q4_K_M | ~4.2 GB | 2 vCPU / 8 GB | Code generation, reasoning tasks |
| Qwen 2.5 7B | 7B | Q4_K_M | ~4.5 GB | 2 vCPU / 8 GB | Multilingual, instruction following |
| CodeLlama 13B | 13B | Q4_K_M | ~7.8 GB | 4 vCPU / 16 GB | Advanced code generation, debugging |
| Llama 3.1 70B | 70B | Q2_K | ~28 GB | 8 vCPU / 32 GB | Enterprise-grade reasoning, analysis |
| DeepSeek Coder 33B | 33B | Q4_K_M | ~19 GB | 8 vCPU / 32 GB | Large-scale code understanding |
CPU Considerations
Without a GPU, Ollama relies entirely on CPU for inference. CPU single-thread performance directly determines token generation speed. AMD Ryzen 9 7950X processors deliver the best single-thread performance available on VPS platforms, generating 6-10 tokens per second for 8B models. AMD EPYC and Intel Xeon server processors provide 4-8 tokens per second for the same models due to lower clock speeds. For interactive applications where latency matters, prioritize single-thread performance over core count.
Multi-core utilization improves when serving multiple concurrent requests. With 4 vCPUs, Ollama can process 2-3 simultaneous inference requests without significant degradation. For a single-user setup, 2 vCPUs is sufficient. For an API endpoint serving a small team, 4 vCPUs is recommended. For public-facing applications, consider 8 vCPUs.
Storage Requirements
Model files range from 2 GB (small quantized models) to 40 GB (large models with higher quantization). NVMe storage significantly reduces model loading time — a 4.8 GB Llama 3 8B file loads in approximately 3 seconds on NVMe versus 15-20 seconds on SATA SSD. For frequently switching between models, this difference matters. Always choose NVMe storage for Ollama workloads.
Step-by-Step Ollama Installation
The following commands install and configure Ollama on a fresh Ubuntu 22.04 or 24.04 VPS. The entire process takes approximately 5 minutes.
Step 1: Update System and Install Dependencies
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential tools
sudo apt install -y curl wget gnupg ca-certificates
# Verify available memory
free -h
df -h
Step 2: Install Ollama
# Install Ollama using the official install script
curl -fsSL https://ollama.com/install.sh | sh
# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama
# Verify the service is running
sudo systemctl status ollama
Step 3: Download and Run Your First Model
# Pull Llama 3 8B (most popular general-purpose model)
ollama pull llama3:8b
# Test inference with a simple prompt
ollama run llama3:8b "Explain quantum computing in 3 sentences"
# Check which models are installed
ollama list
# Verify Ollama API is responding
curl http://localhost:11434/api/tags
Step 4: Configure Ollama for Remote Access
By default, Ollama only listens on localhost. To access it from other machines or expose it behind a reverse proxy, configure the listen address.
# Edit the Ollama service configuration
sudo systemctl edit ollama
# Add the following content (press Ctrl+X, then Y, then Enter to save):
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Restart the service
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify it's listening on all interfaces
ss -tlnp | grep 11434
Step 5: Set Up Nginx Reverse Proxy with Authentication
# Install Nginx
sudo apt install -y nginx
# Create API key file for authentication
echo -n 'admin:' | sudo tee /etc/nginx/.ollama_api_key
openssl rand -base64 32 | sudo tee -a /etc/nginx/.ollama_api_key
# Install htpasswd utility and create password file
sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.ollama_pass admin
# Create Nginx configuration
sudo tee /etc/nginx/sites-available/ollama << 'EOF'
server {
listen 80;
server_name ollama.yourdomain.com;
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.ollama_pass;
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Allow large model uploads
client_max_body_size 0;
# Streaming support for long responses
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 600s;
}
}
EOF
# Enable the site and restart Nginx
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
# Add SSL with Certbot
sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ollama.yourdomain.com
Running Ollama with Docker
Docker provides a clean, isolated environment for Ollama that simplifies updates and management. For more on Docker, see Docker on Ubuntu VPS. The Docker approach is recommended for production deployments where you want reproducible configurations.
# Create a directory for Ollama data
mkdir -p ~/ollama-data
# Run Ollama in a Docker container
docker run -d \
--name ollama \
--restart unless-stopped \
-v ~/ollama-data:/root/.ollama \
-p 11434:11434 \
ollama/ollama:latest
# Pull a model inside the container
docker exec ollama ollama pull llama3:8b
# Test inference
docker exec ollama ollama run llama3:8b "Hello, how are you?"
Docker Compose Configuration
# Create docker-compose.yml
cat > ~/ollama-docker/docker-compose.yml << 'EOF'
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
memory: 6G
environment:
- OLLAMA_HOST=0.0.0.0:11434
- OLLAMA_ORIGINS=*
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- open_webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama_data:
open_webui_data:
EOF
# Start the stack
cd ~/ollama-docker && docker compose up -d
This configuration includes Open WebUI, a ChatGPT-like interface that connects to your local Ollama instance. It provides a web-based chat interface, model management, conversation history, and multi-user support — perfect for sharing your Ollama deployment with a team.
Reducing Memory Usage with Quantization
Quantization is the process of compressing model weights to reduce memory requirements and improve inference speed. Ollama automatically uses quantized (GGUF) versions of models. Understanding the quantization levels helps you choose the right balance between model quality and resource consumption.
| Quantization | Bits per Weight | Size vs FP16 | Quality Loss | Recommended For |
|---|---|---|---|---|
| Q2_K | 2-bit | ~17% | Noticeable degradation | Testing, very constrained hardware |
| Q3_K_M | 3-bit | ~22% | Moderate degradation | When Q4 does not fit in RAM |
| Q4_K_M | 4-bit | ~29% | Minimal degradation | Best balance of quality and size |
| Q5_K_M | 5-bit | ~36% | Nearly imperceptible | High quality, moderate RAM |
| Q8_0 | 8-bit | ~57% | Almost none | Maximum quality on limited hardware |
| FP16 | 16-bit | 100% | None (baseline) | GPU inference, not practical on CPU |
Performance Benchmarks
We tested Ollama inference performance across three VPS providers using identical configurations: Ubuntu 24.04, Llama 3 8B Q4_K_M, single-user sequential inference. Tokens per second were measured using the Ollama API with a standard prompt.
| Provider | Plan | Price/mo | CPU | RAM | Tokens/sec (Llama 3 8B) | Time to First Token |
|---|---|---|---|---|---|---|
| Inferno VPS | Pro | $19.99 | 4 vCPU (Ryzen 9 7950X) | 8 GB | 9.2 | 0.8s |
| Hetzner | CX32 | $8.86 | 4 vCPU (EPYC) | 8 GB | 6.4 | 1.2s |
| DigitalOcean | 4 vCPU / 8 GB | $48.00 | 4 vCPU (Xeon) | 8 GB | 5.8 | 1.4s |
| Vultr | 4 vCPU / 8 GB | $48.00 | 4 vCPU (EPYC) | 8 GB | 6.1 | 1.3s |
| Contabo | VPS S | $7.99 | 4 vCPU (EPYC) | 8 GB | 4.2 | 2.1s |
Inferno VPS leads with 9.2 tokens per second on the Ryzen 9 7950X — 44% faster than Hetzner and 119% faster than Contabo. The high single-thread clock speed of the Ryzen processor provides a clear advantage for CPU-bound LLM inference. At $19.99/month, Inferno delivers better performance than DigitalOcean and Vultr at less than half the price.
Recommended Inferno VPS Plans for Ollama
| Models Supported | Inferno Plan | vCPU | RAM | Storage | Price/mo |
|---|---|---|---|---|---|
| Phi-3 Mini, Gemma 2B | Starter | 1 | 1 GB | 20 GB NVMe | $3.49 |
| Llama 3 8B, Mistral 7B, Qwen 2.5 7B | Pro | 4 | 8 GB | 80 GB NVMe | $19.99 |
| CodeLlama 13B, Qwen 2.5 14B | Enterprise | 4 | 16 GB | 160 GB NVMe | $29.99 |
| DeepSeek Coder 33B, Llama 3.1 70B (Q2) | Elite | 8 | 32 GB | 320 GB NVMe | $49.99 |
Optimization Tips for Better Performance
Configure Swap Space
Swap space prevents out-of-memory crashes during model loading. Configure 4-8 GB of swap as a safety net.
# Create 4 GB swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
Set OLLAMA_NUM_PARALLEL
This environment variable controls how many concurrent requests Ollama processes. Setting it too high on limited hardware causes memory thrashing.
# For 2 vCPU: process 1 request at a time
sudo systemctl edit ollama
# Add: Environment="OLLAMA_NUM_PARALLEL=1"
# For 4 vCPU: process 2 requests concurrently
# Add: Environment="OLLAMA_NUM_PARALLEL=2"
# For 8 vCPU: process 4 requests concurrently
# Add: Environment="OLLAMA_NUM_PARALLEL=4"
Limit Model Keep-Alive
By default, Ollama keeps models in memory for 5 minutes after the last request. Reduce this to free RAM when switching between models.
# Unload models after 2 minutes of inactivity
export OLLAMA_KEEP_ALIVE=2m
Pros and Cons: VPS for Ollama
Advantages
- Flat monthly pricing with zero per-token costs — predictable budgeting
- Full data privacy — prompts never leave your server
- 24/7 availability for API endpoints and chatbot deployments
- NVMe storage enables fast model loading and switching
- Docker support for reproducible, isolated deployments
- No vendor lock-in — migrate to any Linux server anytime
- Can serve multiple models simultaneously with proper configuration
Disadvantages
- CPU inference is significantly slower than GPU-based cloud services
- Large models (70B+) require 32+ GB RAM, which is expensive
- No auto-scaling — must manually upgrade VPS for increased traffic
- Requires Linux administration skills for setup and maintenance
- Token generation speed (4-10 tps) may be too slow for real-time applications
- You are responsible for security, backups, and monitoring
Frequently Asked Questions
What is the minimum VPS spec for running Ollama?
The absolute minimum is 1 vCPU and 2 GB RAM, which can run small models like Gemma 2B (1.5 GB). However, for practical use with models like Llama 3 8B or Mistral 7B, you need at least 2 vCPU and 8 GB RAM. The model weights alone consume 4-5 GB, leaving 3-4 GB for the OS and runtime.
Can I run multiple models simultaneously on one VPS?
Yes, but you need sufficient RAM. Running Llama 3 8B and Mistral 7B simultaneously requires approximately 10 GB of RAM just for the models. With a 16 GB VPS, this is feasible. Ollama automatically unloads models from memory after a period of inactivity (configurable via OLLAMA_KEEP_ALIVE), so memory is freed when a model is not being used.
How does Ollama compare to using OpenAI API?
OpenAI GPT-4 produces higher quality outputs and generates tokens faster (cloud GPU infrastructure). However, OpenAI charges per token and your data leaves your server. Ollama with Llama 3 8B provides 80-90% of GPT-3.5 quality at zero marginal cost after the VPS is provisioned. For cost-sensitive applications or privacy requirements, Ollama on a VPS is the better choice.
Is Docker required for Ollama?
No. Ollama installs as a systemd service via the official install script and works perfectly without Docker. Docker is optional and useful if you want container isolation, easier management with Docker Compose, or if you are running Ollama alongside other services in a containerized stack.
How do I access Ollama from my local machine?
Use SSH port forwarding for secure access without opening the Ollama port to the internet: ssh -L 11434:localhost:11434 user@your-vps-ip. Then connect your local Ollama client or Open WebUI to localhost:11434. For team access, set up an Nginx reverse proxy with basic authentication and SSL.
What is the best model for a VPS with 8 GB RAM?
Llama 3 8B (Q4_K_M) is the best all-around model for 8 GB VPS instances. It handles general chat, coding assistance, summarization, and RAG pipelines well. Mistral 7B is a strong alternative for code-focused tasks. Both models generate 6-10 tokens per second on a 4 vCPU VPS.
Can I use Ollama for a production API?
Yes. Ollama provides an OpenAI-compatible API endpoint on port 11434. You can point any application that supports OpenAI API format to your Ollama instance by changing the base URL. Many frameworks including LangChain, LlamaIndex, and Open WebUI support Ollama natively. Add authentication via Nginx and consider rate limiting for public endpoints.
How much storage do I need for Ollama?
Each model consumes storage equal to its quantized file size: 2-3 GB for small models (Gemma 2B), 4-5 GB for 7-8B models, 8-10 GB for 13-14B models, and 20-40 GB for 70B models. Plan for 2x your total model size to accommodate future downloads and Docker image layers. NVMe storage is strongly recommended for fast model loading.