Running AI workloads on CPU-only VPS instances is fine for prototypes—but when you need to train models, run inference at scale, or serve real-time predictions, GPU compute is non-negotiable. Vultr Cloud GPU instances deliver NVIDIA A100 and H100 GPUs at competitive prices, with per-hour billing that makes GPU experimentation economically viable.

This guide walks through setting up a production-ready AI development environment on Vultr: Ubuntu 22.04 configuration, CUDA drivers, Python environment, PyTorch, and deploying your first model.

Why Choose Vultr for AI Development?

Three factors make Vultr a strong choice for AI/ML workloads in 2026:

  • Competitive GPU pricing: A100 instances start at ~$2.20/hr with per-second billing—only pay for what you use.
  • Global GPU availability: GPU instances available across 12+ regions including US, EU, and Asia-Pacific.
  • Flexible bare metal and cloud GPU: Choose dedicated bare metal for maximum performance or cloud GPU for elastic scaling.

Compared to AWS SageMaker or GCP Vertex AI, Vultr GPU instances give you raw compute at a fraction of the managed service premium. The tradeoff: you're responsible for driver installation, environment setup, and infrastructure management.

Vultr GPU Instance Plans 2026

Instance Type GPU VRAM vCPUs RAM Storage Price/hr
GPU-4 Plus NVIDIA A100 40GB 8 32GB 200GB NVMe $2.20
GPU-8 Plus NVIDIA H100 80GB 16 64GB 400GB NVMe $4.50
GPU Metal NVIDIA A100 (dedicated) 40GB 16 64GB 1TB NVMe $3.20

For most LLM fine-tuning and inference workloads, a single A100 (40GB VRAM) is sufficient. Use H100 for multi-GPU training or large model inference requiring 70B+ parameter capacity.

Step 1: Deploy Ubuntu GPU Instance

Deploy via Vultr dashboard or CLI:

# Using Vultr CLI
vultr instance create \
  --region=ewr \
  --plan=gpu-v100-30gb \
  --os=Ubuntu\ 22.04 LTS \
  --label=ai-dev-gpu

# Or via API
curl -X POST "https://api.vultr.com/v2/instances" \
  -H "Authorization: Bearer ${VULTR_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "region": "ewr",
    "plan": "gpu-v100-30gb",
    "os_id": 270,
    "label": "ai-dev-gpu"
  }'

After deployment, SSH into your instance. Write down the IP address from the Vultr dashboard.

ssh root@YOUR_INSTANCE_IP

Step 2: Configure Ubuntu 22.04 for GPU Compute

Start with a full system update and essential packages:

# Update system
apt update && apt upgrade -y

# Install essential tools
apt install -y build-essential curl wget git unzip vim \
  software-properties-common gnupg apt-transport-https ca-certificates

Configure Network Repositories

For faster package downloads, configure Ubuntu's mirror selection:

# Set optimal mirror (example for US East)
sed -i 's|http://us-east-1.ec2.archive.ubuntu.com|http://mirrors.digitalocean.com|g' \
  /etc/apt/sources.list

apt update

Step 3: Install NVIDIA Drivers

Vultr GPU instances come with pre-configured NVIDIA GPU hardware, but you'll need to install drivers. The official NVIDIA CUDA repository is the most reliable method:

# Add NVIDIA CUDA repository
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ \
  cuda-keyring_1.1-1_all.deb -o cuda-keyring.deb

dpkg -i cuda-keyring.deb
apt update

# Install NVIDIA driver + CUDA toolkit
apt install -y nvidia-driver-545 cuda-toolkit-12-3

# Verify driver installation
nvidia-smi

Expected output from nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08   Driver Version: 545.23.08   CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA A100-40GB     On   | 00000000:00:1E.0 Off |                    0 |
|  0%   37C    P0    37W / 250W |     0MiB / 40536MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Step 4: Set Up Python Environment with CUDA Support

Use Miniconda for isolated, reproducible Python environments:

# Download and install Miniconda
cd /tmp
curl -LO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b

# Initialize conda
source /root/miniconda3/etc/profile.d/conda.sh

# Create GPU-accelerated Python environment
conda create -n ai-env python=3.11 -y
conda activate ai-env

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/cu121

# Verify CUDA availability in Python
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else None}')"

Expected output: CUDA available: True\nGPU: NVIDIA A100-40GB

Step 5: Deploy a Production ML Model

Let's deploy a text classification model using FastAPI for inference serving. This pattern works for any trained model—LLMs, image classifiers, recommendation engines.

Install FastAPI and dependencies

# Install web serving stack
pip install fastapi uvicorn transformers \
  huggingface-hub accelerate sentencepiece

# Create project structure
mkdir -p /opt/ml-api && cd /opt/ml-api
touch main.py requirements.txt

Create the inference API

# /opt/ml-api/main.py
from fastapi import FastAPI, Request
from pydantic import BaseModel
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

app = FastAPI(title="Text Classifier API", version="1.0.0")

# Load pre-trained model at startup
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()

class TextInput(BaseModel):
    text: str

class PredictionOutput(BaseModel):
    label: str
    confidence: float
    model: str

@app.post("/predict", response_model=PredictionOutput)
async def predict(input: TextInput):
    with torch.no_grad():
        inputs = tokenizer(input.text, return_tensors="pt", truncation=True, max_length=512)
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=-1)
        pred_label = "POSITIVE" if torch.argmax(probs) == 1 else "NEGATIVE"
        confidence = probs[0][torch.argmax(probs)].item()
    
    return PredictionOutput(
        label=pred_label,
        confidence=round(confidence, 4),
        model=MODEL_NAME
    )

@app.get("/health")
async def health():
    return {"status": "healthy", "gpu": torch.cuda.is_available()}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Test the API

# Start the server
cd /opt/ml-api
python main.py &

# Test prediction
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Vultr GPU instances are fantastic for AI development!"}'

Expected response:

{"label":"POSITIVE","confidence":0.9986,"model":"distilbert-base-uncased-finetuned-sst-2-english"}

Step 6: Set Up Nginx Reverse Proxy and SSL

For production deployment, wrap FastAPI with Nginx for load balancing, static file serving, and HTTPS termination:

# Install Nginx
apt install -y nginx

# Create reverse proxy config
cat > /etc/nginx/sites-available/ml-api << 'EOF'
server {
    listen 80;
    server_name YOUR_DOMAIN_OR_IP;

    location / {
        proxy_pass http://127.0.0.1:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # For streaming responses (LLM inference)
        proxy_buffering off;
        proxy_read_timeout 300s;
    }
}
EOF

# Enable site
ln -s /etc/nginx/sites-available/ml-api /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

For HTTPS, use Let's Encrypt:

apt install -y certbot python3-certbot-nginx
certbot --nginx -d YOUR_DOMAIN

Step 7: Production Considerations and Cost Optimization

Auto-shutdown Script

GPU instances are expensive. Implement auto-shutdown to avoid idle billing:

# /opt/shutdown-idle-gpu.sh
#!/bin/bash
IDLE_MINUTES=30
THRESHOLD=5  # GPU utilization %

UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
if [ "$UTIL" -lt "$THRESHOLD" ]; then
    IDLE_TIME=$(cat /proc/uptime | awk '{print int($1/60)}')
    if [ "$IDLE_TIME" -gt "$IDLE_MINUTES" ]; then
        echo "GPU idle for $IDLE_TIME minutes. Shutting down."
        /usr/sbin/shutdown -h now
    fi
fi

# Add to crontab: */5 * * * * /opt/shutdown-idle-gpu.sh

Monitoring with Prometheus + Grafana

# Install node_exporter for system metrics
apt install -y prometheus-node-exporter

# Install NVIDIA GPU exporter
pip install nvidia-ml-py
curl -LO https://github.com/utkuonur/nvidia-gpu-exporter/releases/download/v1.0.0/gpu_exporter_linux_amd64
chmod +x gpu_exporter_linux_amd64
mv gpu_exporter_linux_amd64 /usr/local/bin/

# Add to systemd service for auto-start
cat > /etc/systemd/system/gpu-exporter.service << 'EOF'
[Unit]
Description=NVIDIA GPU Metrics Exporter
After=network.target

[Service]
ExecStart=/usr/local/bin/gpu_exporter_linux_amd64
Restart=always

[Install]
WantedBy=multi-user.target
EOF

systemctl enable gpu-exporter
systemctl start gpu-exporter

Backup and Persistence

Store trained models and datasets on Vultr Block Storage for durability:

# Create and attach block storage (requires Vultr dashboard or API)
# Attach 100GB block storage to /dev/vdb

mkfs.ext4 -F /dev/vdb
mkdir -p /mnt/models
mount /dev/vdb /mnt/models

# Add to /etc/fstab for auto-mount on reboot
echo '/dev/vdb /mnt/models ext4 defaults,nofail 0 2' >> /etc/fstab

Troubleshooting Common GPU Setup Issues

Problem: nvidia-smi reports "No devices were found"

Cause: NVIDIA driver not loaded, or GPU not passthrough to the VM.
Solution: Check that the instance type supports GPU. Reboot the instance after driver installation. If issue persists, destroy and redeploy a new GPU instance.

Problem: PyTorch reports "CUDA out of memory" on small batches

Cause: Insufficient VRAM for model + batch size combination.
Solution: Reduce batch size (batch_size=4), enable gradient checkpointing, or upgrade to a larger GPU instance with more VRAM.

Problem: Model downloads timing out from HuggingFace

Cause: Slow or restricted network connectivity on the GPU instance.
Solution: Configure HF mirror for Chinese regions: export HF_ENDPOINT=https://hf-mirror.com

Conclusion

Vultr GPU instances provide enterprise-grade NVIDIA compute at accessible prices. With Ubuntu 22.04, proper driver installation, and a well-configured Python environment, you can run anything from lightweight inference APIs to full model training pipelines.

The key to cost-effective GPU computing on Vultr: use per-second billing for experimentation, implement auto-shutdown scripts for idle instances, and leverage block storage for model persistence rather than burning GPU instance hours on data I/O.

Start Your AI Project on Vultr

Deploy a GPU instance today and get $250 in free credits:

Deploy GPU Instance Now →