Vultr GPU Instances in 2026: A100, H100 & L40S for AI & Machine Learning

GPU cloud instances have become essential infrastructure for AI and machine learning workloads. Whether you're fine-tuning a large language model, running computer vision inference, or training a recommendation engine — the right GPU can cut your compute costs by 50% or more compared to general-purpose cloud options.

Vultr's GPU lineup puts enterprise-grade accelerators within reach of indie developers and startups. No long-term commitments, per-second billing, and configurations starting at $0.015/hour. This guide breaks down every GPU option, benchmarks them against real workloads, and shows you exactly how to deploy your first ML model.

Vultr GPU Options: Which Card Do You Need?

Vultr offers three GPU families, each targeting a different use case. Here's the honest breakdown:

GPU	VRAM	TDP	Best For	Starting Price
NVIDIA L40S	48 GB GDDR6	350W	Stable Diffusion, fine-tuning, inference	$0.89/hr
NVIDIA A100	40 GB HBM2	250W	Training, transformers, large models	$1.89/hr
NVIDIA H100	80 GB HBM3	700W	LLM training, frontier AI research	$4.40/hr

L40S — The Value Champion

The L40S is Vultr's most cost-effective GPU. With 48GB of VRAM, it handles most fine-tuning tasks and image generation workloads without the premium pricing of A100 or H100. It's based on the Ada Lovelace architecture, which means excellent efficiency for inference-heavy tasks. For a solo developer running Stable Diffusion or fine-tuning Mistral 7B, the L40S is the obvious choice — you get more VRAM than the A100 at a lower price point.

A100 — The Workhorse

The A100 40GB remains the industry standard for a reason. Its 2TB/s memory bandwidth and third-gen Tensor Cores make it exceptional for training medium-sized models. If you're running PyTorch training jobs that span days, the A100's reliability and mature software ecosystem (CUDA, cuDNN, Triton) are hard to beat. Vultr's per-second billing means you can spin up an A100 for a 4-hour training run and pay only for those 4 hours.

H100 — The Frontier Beast

The H100 is for serious compute. With 80GB of HBM3 memory and fourth-gen Tensor Cores with FP8 support, it's the GPU of choice for training GPT-class models and running inference on the largest open-source LLMs like Llama 3 70B. At $4.40/hr, it's not cheap — but compared to buying an H100 server (which costs $30,000–$40,000), cloud access is a no-brainer for anyone who isn't running GPU workloads 24/7.

Step 1: Deploy a Vultr GPU Instance

GPU instances are available in Vultr's Cloud Compute and High Performance Compute lines. Here's how to get one running in under 5 minutes.

# Using Vultr CLI (recommended for automation)
vultr-cli instance create \
  --region=sjc \
  --plan=gpu-vcgi-40g-a100-nvidia \
  --os=Ubuntu-22.04 \
  --script-file=setup.sh \
  --label=ml-gpu-prod

# Or via the dashboard:
# Compute → Deploy New Instance → GPU → Choose L40S, A100, or H100
# Select Ubuntu 22.04 LTS
# Enable Private Networking for multi-instance clusters

Regions with GPU availability: New Jersey (nj), San Jose (sjc), Seattle (sea), Chicago (ord), Los Angeles (lax), Miami (mia), Atlanta (atl), Tokyo (nrt). Not all GPU types are available in every region — check the Vultr dashboard for real-time availability.

Step 2: Install CUDA and Docker

Modern ML frameworks (PyTorch, TensorFlow, JAX) all require CUDA. Vultr's GPU Ubuntu images come with NVIDIA drivers pre-installed, but you'll need to set up CUDA Toolkit and Docker for containerized ML workloads.

# Check if NVIDIA drivers are installed
nvidia-smi
# Should output GPU info if drivers are loaded

# Install CUDA Toolkit 12.x (required for PyTorch 2.x)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y cuda-toolkit-12-4

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify CUDA installation
nvcc --version

Set Up Docker with NVIDIA Container Toolkit

For reproducible ML environments, run your models inside Docker containers with GPU passthrough:

# Install Docker
curl -fsSL https://get.docker.com | sh
sudo systemctl enable docker

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution_amd64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install -y nvidia-container-toolkit

# Configure Docker to use NVIDIA runtime by default
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU access from inside a container
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Important: Always use container images built for your specific CUDA version. Mixing CUDA 11.x containers with a CUDA 12.x host will cause runtime errors. Stick to nvidia/cuda:12.x.x-base-ubuntu22.04 for PyTorch and TensorFlow containers.

Step 3: Deploy a PyTorch Model on Vultr GPU

Let's put the GPU to work with a real example: running inference with a fine-tuned Mistral 7B model for a chatbot backend.

Pull a GPU-Optimized PyTorch Container

# Create project directory
mkdir ml-server && cd ml-server

# Create Dockerfile
cat > Dockerfile << 'EOF'
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime

WORKDIR /app

# Install inference dependencies
RUN pip install transformers accelerate bitsandbytes scipy \
  fastapi uvicorn sentencepiece

COPY server.py .
EXPOSE 8000

CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
EOF

Build the FastAPI Server

# server.py - FastAPI ML inference server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

# Load model and tokenizer (downloads on first run)
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2"

print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,  # Quantize to fit in 40GB
)
print("Model loaded successfully!")

class InferenceRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 256
    temperature: float = 0.7

@app.post("/infer")
def infer(request: InferenceRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=request.max_new_tokens,
        temperature=request.temperature,
        do_sample=True,
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"response": response, "model": MODEL_NAME}

@app.get("/health")
def health():
    return {"status": "ok", "gpu": torch.cuda.get_device_name(0)}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Build, Run, and Test

# Build the container
docker build -t ml-inference-server .

# Run with GPU access
docker run -d \
  --name ml-server \
  --restart unless-stopped \
  --gpus all \
  -p 8000:8000 \
  ml-inference-server

# Test the API
curl -X POST http://localhost:8000/infer \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain the difference between a GPU and a CPU in simple terms.", "max_new_tokens": 200}'

# Check GPU utilization
nvidia-smi

With 4-bit quantization, Mistral 7B fits comfortably on a single A100 40GB. For larger models like Llama 3 70B, you'd need an H100 80GB or multi-GPU setup with tensor parallelism.

Real-World Benchmark: Vultr GPU vs AWS

How does Vultr's GPU pricing stack up against AWS EC2? Here's a direct comparison for a training job that takes 8 hours:

Provider	GPU	Price/hr	8hr Cost	VRAM
Vultr	A100 40GB	$1.89	$15.12	40GB
AWS p4d.24xlarge	A100 40GB x8	$32.77	$262.16	320GB total
Vultr	H100 80GB	$4.40	$35.20	80GB
AWS p5.48xlarge	H100 80GB x8	$98.32	$786.56	640GB total

Vultr's single-GPU instances crush AWS on price-per-GPU. For distributed training requiring multiple GPUs, AWS's 8-GPU nodes have an advantage in NVLink bandwidth — but for the vast majority of models, a single Vultr H100 or A100 handles the job at a fraction of the cost.

If you're building a sports analytics platform or need to process large datasets for predictions, check out cloudbet-guide's sports data processing setup for complementary infrastructure patterns.

Optimize GPU Utilization for Inference

Running a GPU at 10% utilization is money down the drain. Here's how to maximize throughput:

Batch requests — instead of processing one prompt at a time, batch multiple requests together using Dynamic Batching
Use Flash Attention 2 — drop-in replacement for standard attention that cuts memory usage by ~50% and speeds up transformers by 2-4x
Quantize aggressively — 4-bit quantization (GPTQ, AWQ) dramatically reduces VRAM with minimal quality loss
Enable tensor parallelism — for models too large for a single GPU, shard across 2-4 GPUs on Vultr's private network
Use vLLM — the vLLM library achieves 2-5x higher throughput than naive HuggingFace inference via PagedAttention

# Example: Run vLLM for high-throughput inference
docker run -d \
  --name vllm-server \
  --gpus all \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9

Cost Optimization Strategies

GPU compute is expensive if you waste it. These strategies will keep your bill under control:

Use Spot/Preemptible Instances

Vultr's High Frequency instances can be stopped and started on demand. For fault-tolerant training jobs, implement checkpointing so you can resume from the last saved state if an instance is reclaimed:

# PyTorch checkpointing example
import torch

checkpoint_interval = 500  # Save every 500 steps

def save_checkpoint(model, optimizer, step, loss):
    checkpoint = {
        'step': step,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'loss': loss,
    }
    torch.save(checkpoint, f'checkpoint-step-{step}.pt')
    print(f"Checkpoint saved at step {step}")

# During training:
for step, batch in enumerate(dataloader):
    # ... forward/backward pass ...
    if step % checkpoint_interval == 0:
        save_checkpoint(model, optimizer, step, loss)

Auto-Shutdown with Watchdog

#!/bin/bash
# watchdog.sh - shut down instance if no GPU activity for 30 minutes
IDLE_THRESHOLD=1800  # 30 minutes in seconds
LOG_FILE=/var/log/gpu_idle.log

while true; do
    GPU_UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits)
    if [ "$GPU_UTIL" -lt 1 ]; then
        IDLE_TIME=$((IDLE_TIME + 60))
        if [ $IDLE_TIME -ge $IDLE_THRESHOLD ]; then
            echo "$(date): GPU idle for ${IDLE_THRESHOLD}s. Shutting down." >> $LOG_FILE
            sudo shutdown -h now
        fi
    else
        IDLE_TIME=0
    fi
    sleep 60
done

SSH keepalive: If you're using the GPU interactively via SSH, set ServerAliveInterval 60 in your SSH config so the watchdog doesn't trigger while you're actively working.

Start Your First GPU Instance Today

Deploy an L40S, A100, or H100 in minutes. Get $100 free credit when you sign up — no credit card required.

Claim Vultr Free Credit →

Conclusion

Vultr's GPU cloud gives you access to enterprise-grade accelerators without the enterprise price tag. For most ML workloads, an L40S or A100 is the sweet spot between cost and capability. The H100 is reserved for frontier AI research and LLM training at scale.

Per-second billing means you pay only for what you use — a massive advantage over AWS and GCP for development and experimentation where GPU time is intermittent. Spin up an instance, deploy your model, benchmark it, and shut it down when you're done.

Get started with Vultr GPU instances and $100 in free credit.

GPU Vultr Machine Learning AI A100 H100 PyTorch GPU Cloud