Best VPS for Ollama (2025) — Run Local LLMs

Ollama has revolutionized the way developers and businesses run large language models. For broader AI workflows, see VPS for AI Agents by making it trivially simple to download, manage, and serve models like Llama 3, Mistral, and CodeLlama on any Linux machine. Instead of dealing with complex Python environments, CUDA dependencies, and model conversion scripts, Ollama handles everything with a single command. But the ease of setup means the bottleneck shifts entirely to your hardware — and choosing the right VPS is the difference between a responsive chatbot and a server that crawls at two tokens per second.

This guide covers everything you need to know about running Ollama on a VPS in 2025: hardware requirements for every popular model, step-by-step installation, performance benchmarks, optimization techniques, and a comparison of the best VPS providers. Whether you want to build a private ChatGPT alternative or power an AI-powered application, this guide will help you select and configure the right server.

Why Run Ollama on a VPS

Running Ollama locally on your laptop works for experimentation, but production deployments demand a server. A VPS provides 24/7 availability, consistent bandwidth, and dedicated resources that a laptop cannot match. When your AI-powered application handles requests from users across multiple time zones, downtime is not an option. For European hosting with low latency, consider a Germany VPS or Netherlands VPS.

Cost is another compelling reason. API-based LLM services like OpenAI charge per token — a moderately trafficked application can quickly rack up bills of $200-500 per month. Running Ollama on a VPS gives you flat-rate pricing. A $20 VPS can serve thousands of requests per day with zero per-token costs. For teams running multiple agents or processing large document batches, the savings compound rapidly.

Data privacy is increasingly important for organizations in regulated industries. When you send prompts to OpenAI or Anthropic, that data is processed on infrastructure you do not control and may be used for model training. Running Ollama on your own VPS means your data never leaves your server. This is essential for healthcare, legal, financial, and any domain handling sensitive information subject to GDPR or HIPAA requirements.

Hardware Requirements for Ollama Models

Ollama runs entirely on CPU, which means RAM is your most critical resource. The model weights must be loaded into memory before inference can begin. Quantized models (GGUF format) reduce memory requirements by compressing weights from 16-bit floating point to 4-bit integers, but the memory footprint is still substantial for larger models. The following table lists the RAM requirements for the most popular models in 2025.

ModelParametersQuantizationRAM RequiredMin VPS PlanBest Use Case
Phi-3 Mini3.8BQ4_K_M~2.5 GB1 vCPU / 4 GBSimple chatbots, classification tasks
Gemma 2 2B2BQ4_K_M~1.5 GB1 vCPU / 2 GBLightweight tasks, embedded systems
Llama 3 8B8BQ4_K_M~4.8 GB2 vCPU / 8 GBGeneral-purpose chat, RAG pipelines
Mistral 7B7BQ4_K_M~4.2 GB2 vCPU / 8 GBCode generation, reasoning tasks
Qwen 2.5 7B7BQ4_K_M~4.5 GB2 vCPU / 8 GBMultilingual, instruction following
CodeLlama 13B13BQ4_K_M~7.8 GB4 vCPU / 16 GBAdvanced code generation, debugging
Llama 3.1 70B70BQ2_K~28 GB8 vCPU / 32 GBEnterprise-grade reasoning, analysis
DeepSeek Coder 33B33BQ4_K_M~19 GB8 vCPU / 32 GBLarge-scale code understanding
Key insight: Always allocate 2-3 GB of RAM above the model's stated requirement for the operating system, Ollama runtime, and concurrent request processing. A model requiring 4.8 GB needs a VPS with at least 8 GB RAM for stable operation. Running at the minimum will cause out-of-memory crashes under load.

CPU Considerations

Without a GPU, Ollama relies entirely on CPU for inference. CPU single-thread performance directly determines token generation speed. AMD Ryzen 9 7950X processors deliver the best single-thread performance available on VPS platforms, generating 6-10 tokens per second for 8B models. AMD EPYC and Intel Xeon server processors provide 4-8 tokens per second for the same models due to lower clock speeds. For interactive applications where latency matters, prioritize single-thread performance over core count.

Multi-core utilization improves when serving multiple concurrent requests. With 4 vCPUs, Ollama can process 2-3 simultaneous inference requests without significant degradation. For a single-user setup, 2 vCPUs is sufficient. For an API endpoint serving a small team, 4 vCPUs is recommended. For public-facing applications, consider 8 vCPUs.

Storage Requirements

Model files range from 2 GB (small quantized models) to 40 GB (large models with higher quantization). NVMe storage significantly reduces model loading time — a 4.8 GB Llama 3 8B file loads in approximately 3 seconds on NVMe versus 15-20 seconds on SATA SSD. For frequently switching between models, this difference matters. Always choose NVMe storage for Ollama workloads.

Step-by-Step Ollama Installation

The following commands install and configure Ollama on a fresh Ubuntu 22.04 or 24.04 VPS. The entire process takes approximately 5 minutes.

Step 1: Update System and Install Dependencies

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install essential tools
sudo apt install -y curl wget gnupg ca-certificates

# Verify available memory
free -h
df -h

Step 2: Install Ollama

# Install Ollama using the official install script
curl -fsSL https://ollama.com/install.sh | sh

# Start and enable the Ollama service
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify the service is running
sudo systemctl status ollama

Step 3: Download and Run Your First Model

# Pull Llama 3 8B (most popular general-purpose model)
ollama pull llama3:8b

# Test inference with a simple prompt
ollama run llama3:8b "Explain quantum computing in 3 sentences"

# Check which models are installed
ollama list

# Verify Ollama API is responding
curl http://localhost:11434/api/tags

Step 4: Configure Ollama for Remote Access

By default, Ollama only listens on localhost. To access it from other machines or expose it behind a reverse proxy, configure the listen address.

# Edit the Ollama service configuration
sudo systemctl edit ollama

# Add the following content (press Ctrl+X, then Y, then Enter to save):
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"

# Restart the service
sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify it's listening on all interfaces
ss -tlnp | grep 11434
Security warning: Never expose Ollama directly to the internet without authentication. Always place it behind a reverse proxy (Nginx) with API key authentication or use SSH tunneling for access.

Step 5: Set Up Nginx Reverse Proxy with Authentication

# Install Nginx
sudo apt install -y nginx

# Create API key file for authentication
echo -n 'admin:' | sudo tee /etc/nginx/.ollama_api_key
openssl rand -base64 32 | sudo tee -a /etc/nginx/.ollama_api_key

# Install htpasswd utility and create password file
sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.ollama_pass admin

# Create Nginx configuration
sudo tee /etc/nginx/sites-available/ollama << 'EOF'
server {
    listen 80;
    server_name ollama.yourdomain.com;

    location / {
        auth_basic "Ollama API";
        auth_basic_user_file /etc/nginx/.ollama_pass;

        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Allow large model uploads
        client_max_body_size 0;

        # Streaming support for long responses
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 600s;
    }
}
EOF

# Enable the site and restart Nginx
sudo ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

# Add SSL with Certbot
sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d ollama.yourdomain.com

Running Ollama with Docker

Docker provides a clean, isolated environment for Ollama that simplifies updates and management. For more on Docker, see Docker on Ubuntu VPS. The Docker approach is recommended for production deployments where you want reproducible configurations.

# Create a directory for Ollama data
mkdir -p ~/ollama-data

# Run Ollama in a Docker container
docker run -d \
  --name ollama \
  --restart unless-stopped \
  -v ~/ollama-data:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:latest

# Pull a model inside the container
docker exec ollama ollama pull llama3:8b

# Test inference
docker exec ollama ollama run llama3:8b "Hello, how are you?"

Docker Compose Configuration

# Create docker-compose.yml
cat > ~/ollama-docker/docker-compose.yml << 'EOF'
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          memory: 6G
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
      - OLLAMA_ORIGINS=*

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama_data:
  open_webui_data:
EOF

# Start the stack
cd ~/ollama-docker && docker compose up -d

This configuration includes Open WebUI, a ChatGPT-like interface that connects to your local Ollama instance. It provides a web-based chat interface, model management, conversation history, and multi-user support — perfect for sharing your Ollama deployment with a team.

Reducing Memory Usage with Quantization

Quantization is the process of compressing model weights to reduce memory requirements and improve inference speed. Ollama automatically uses quantized (GGUF) versions of models. Understanding the quantization levels helps you choose the right balance between model quality and resource consumption.

QuantizationBits per WeightSize vs FP16Quality LossRecommended For
Q2_K2-bit~17%Noticeable degradationTesting, very constrained hardware
Q3_K_M3-bit~22%Moderate degradationWhen Q4 does not fit in RAM
Q4_K_M4-bit~29%Minimal degradationBest balance of quality and size
Q5_K_M5-bit~36%Nearly imperceptibleHigh quality, moderate RAM
Q8_08-bit~57%Almost noneMaximum quality on limited hardware
FP1616-bit100%None (baseline)GPU inference, not practical on CPU
Recommendation: Use Q4_K_M quantization for almost all use cases. The quality difference from FP16 is minimal for most tasks, while memory usage drops by approximately 70%. Only use Q5 or Q8 if you have abundant RAM and need maximum quality for specialized tasks like legal analysis or medical document processing.

Performance Benchmarks

We tested Ollama inference performance across three VPS providers using identical configurations: Ubuntu 24.04, Llama 3 8B Q4_K_M, single-user sequential inference. Tokens per second were measured using the Ollama API with a standard prompt.

ProviderPlanPrice/moCPURAMTokens/sec (Llama 3 8B)Time to First Token
Inferno VPSPro$19.994 vCPU (Ryzen 9 7950X)8 GB9.20.8s
HetznerCX32$8.864 vCPU (EPYC)8 GB6.41.2s
DigitalOcean4 vCPU / 8 GB$48.004 vCPU (Xeon)8 GB5.81.4s
Vultr4 vCPU / 8 GB$48.004 vCPU (EPYC)8 GB6.11.3s
ContaboVPS S$7.994 vCPU (EPYC)8 GB4.22.1s

Inferno VPS leads with 9.2 tokens per second on the Ryzen 9 7950X — 44% faster than Hetzner and 119% faster than Contabo. The high single-thread clock speed of the Ryzen processor provides a clear advantage for CPU-bound LLM inference. At $19.99/month, Inferno delivers better performance than DigitalOcean and Vultr at less than half the price.

Recommended Inferno VPS Plans for Ollama

Models SupportedInferno PlanvCPURAMStoragePrice/mo
Phi-3 Mini, Gemma 2BStarter11 GB20 GB NVMe$3.49
Llama 3 8B, Mistral 7B, Qwen 2.5 7BPro48 GB80 GB NVMe$19.99
CodeLlama 13B, Qwen 2.5 14BEnterprise416 GB160 GB NVMe$29.99
DeepSeek Coder 33B, Llama 3.1 70B (Q2)Elite832 GB320 GB NVMe$49.99

Optimization Tips for Better Performance

Configure Swap Space

Swap space prevents out-of-memory crashes during model loading. Configure 4-8 GB of swap as a safety net.

# Create 4 GB swap file
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Set OLLAMA_NUM_PARALLEL

This environment variable controls how many concurrent requests Ollama processes. Setting it too high on limited hardware causes memory thrashing.

# For 2 vCPU: process 1 request at a time
sudo systemctl edit ollama
# Add: Environment="OLLAMA_NUM_PARALLEL=1"

# For 4 vCPU: process 2 requests concurrently
# Add: Environment="OLLAMA_NUM_PARALLEL=2"

# For 8 vCPU: process 4 requests concurrently
# Add: Environment="OLLAMA_NUM_PARALLEL=4"

Limit Model Keep-Alive

By default, Ollama keeps models in memory for 5 minutes after the last request. Reduce this to free RAM when switching between models.

# Unload models after 2 minutes of inactivity
export OLLAMA_KEEP_ALIVE=2m

Pros and Cons: VPS for Ollama

Advantages

  • Flat monthly pricing with zero per-token costs — predictable budgeting
  • Full data privacy — prompts never leave your server
  • 24/7 availability for API endpoints and chatbot deployments
  • NVMe storage enables fast model loading and switching
  • Docker support for reproducible, isolated deployments
  • No vendor lock-in — migrate to any Linux server anytime
  • Can serve multiple models simultaneously with proper configuration

Disadvantages

  • CPU inference is significantly slower than GPU-based cloud services
  • Large models (70B+) require 32+ GB RAM, which is expensive
  • No auto-scaling — must manually upgrade VPS for increased traffic
  • Requires Linux administration skills for setup and maintenance
  • Token generation speed (4-10 tps) may be too slow for real-time applications
  • You are responsible for security, backups, and monitoring

Frequently Asked Questions

What is the minimum VPS spec for running Ollama?

The absolute minimum is 1 vCPU and 2 GB RAM, which can run small models like Gemma 2B (1.5 GB). However, for practical use with models like Llama 3 8B or Mistral 7B, you need at least 2 vCPU and 8 GB RAM. The model weights alone consume 4-5 GB, leaving 3-4 GB for the OS and runtime.

Can I run multiple models simultaneously on one VPS?

Yes, but you need sufficient RAM. Running Llama 3 8B and Mistral 7B simultaneously requires approximately 10 GB of RAM just for the models. With a 16 GB VPS, this is feasible. Ollama automatically unloads models from memory after a period of inactivity (configurable via OLLAMA_KEEP_ALIVE), so memory is freed when a model is not being used.

How does Ollama compare to using OpenAI API?

OpenAI GPT-4 produces higher quality outputs and generates tokens faster (cloud GPU infrastructure). However, OpenAI charges per token and your data leaves your server. Ollama with Llama 3 8B provides 80-90% of GPT-3.5 quality at zero marginal cost after the VPS is provisioned. For cost-sensitive applications or privacy requirements, Ollama on a VPS is the better choice.

Is Docker required for Ollama?

No. Ollama installs as a systemd service via the official install script and works perfectly without Docker. Docker is optional and useful if you want container isolation, easier management with Docker Compose, or if you are running Ollama alongside other services in a containerized stack.

How do I access Ollama from my local machine?

Use SSH port forwarding for secure access without opening the Ollama port to the internet: ssh -L 11434:localhost:11434 user@your-vps-ip. Then connect your local Ollama client or Open WebUI to localhost:11434. For team access, set up an Nginx reverse proxy with basic authentication and SSL.

What is the best model for a VPS with 8 GB RAM?

Llama 3 8B (Q4_K_M) is the best all-around model for 8 GB VPS instances. It handles general chat, coding assistance, summarization, and RAG pipelines well. Mistral 7B is a strong alternative for code-focused tasks. Both models generate 6-10 tokens per second on a 4 vCPU VPS.

Can I use Ollama for a production API?

Yes. Ollama provides an OpenAI-compatible API endpoint on port 11434. You can point any application that supports OpenAI API format to your Ollama instance by changing the base URL. Many frameworks including LangChain, LlamaIndex, and Open WebUI support Ollama natively. Add authentication via Nginx and consider rate limiting for public endpoints.

How much storage do I need for Ollama?

Each model consumes storage equal to its quantized file size: 2-3 GB for small models (Gemma 2B), 4-5 GB for 7-8B models, 8-10 GB for 13-14B models, and 20-40 GB for 70B models. Plan for 2x your total model size to accommodate future downloads and Docker image layers. NVMe storage is strongly recommended for fast model loading.

Ready to run your own LLM?

Get a high-performance VPS with NVMe SSD, Ryzen 9 processors, and up to 32 GB RAM. Perfect for Ollama, Llama 3, and Mistral inference.

Get Your VPS →