GPU cloud instances have become essential infrastructure for AI and machine learning workloads. Whether you're fine-tuning a large language model, running computer vision inference, or training a recommendation engine — the right GPU can cut your compute costs by 50% or more compared to general-purpose cloud options.

Vultr's GPU lineup puts enterprise-grade accelerators within reach of indie developers and startups. No long-term commitments, per-second billing, and configurations starting at $0.015/hour. This guide breaks down every GPU option, benchmarks them against real workloads, and shows you exactly how to deploy your first ML model.

Vultr GPU Options: Which Card Do You Need?

Vultr offers three GPU families, each targeting a different use case. Here's the honest breakdown:

GPUVRAMTDPBest ForStarting Price
NVIDIA L40S48 GB GDDR6350WStable Diffusion, fine-tuning, inference$0.89/hr
NVIDIA A10040 GB HBM2250WTraining, transformers, large models$1.89/hr
NVIDIA H10080 GB HBM3700WLLM training, frontier AI research$4.40/hr

L40S — The Value Champion

The L40S is Vultr's most cost-effective GPU. With 48GB of VRAM, it handles most fine-tuning tasks and image generation workloads without the premium pricing of A100 or H100. It's based on the Ada Lovelace architecture, which means excellent efficiency for inference-heavy tasks. For a solo developer running Stable Diffusion or fine-tuning Mistral 7B, the L40S is the obvious choice — you get more VRAM than the A100 at a lower price point.

A100 — The Workhorse

The A100 40GB remains the industry standard for a reason. Its 2TB/s memory bandwidth and third-gen Tensor Cores make it exceptional for training medium-sized models. If you're running PyTorch training jobs that span days, the A100's reliability and mature software ecosystem (CUDA, cuDNN, Triton) are hard to beat. Vultr's per-second billing means you can spin up an A100 for a 4-hour training run and pay only for those 4 hours.

H100 — The Frontier Beast

The H100 is for serious compute. With 80GB of HBM3 memory and fourth-gen Tensor Cores with FP8 support, it's the GPU of choice for training GPT-class models and running inference on the largest open-source LLMs like Llama 3 70B. At $4.40/hr, it's not cheap — but compared to buying an H100 server (which costs $30,000–$40,000), cloud access is a no-brainer for anyone who isn't running GPU workloads 24/7.

Step 1: Deploy a Vultr GPU Instance

GPU instances are available in Vultr's Cloud Compute and High Performance Compute lines. Here's how to get one running in under 5 minutes.

# Using Vultr CLI (recommended for automation) vultr-cli instance create \ --region=sjc \ --plan=gpu-vcgi-40g-a100-nvidia \ --os=Ubuntu-22.04 \ --script-file=setup.sh \ --label=ml-gpu-prod # Or via the dashboard: # Compute → Deploy New Instance → GPU → Choose L40S, A100, or H100 # Select Ubuntu 22.04 LTS # Enable Private Networking for multi-instance clusters
Regions with GPU availability: New Jersey (nj), San Jose (sjc), Seattle (sea), Chicago (ord), Los Angeles (lax), Miami (mia), Atlanta (atl), Tokyo (nrt). Not all GPU types are available in every region — check the Vultr dashboard for real-time availability.

Step 2: Install CUDA and Docker

Modern ML frameworks (PyTorch, TensorFlow, JAX) all require CUDA. Vultr's GPU Ubuntu images come with NVIDIA drivers pre-installed, but you'll need to set up CUDA Toolkit and Docker for containerized ML workloads.

# Check if NVIDIA drivers are installed nvidia-smi # Should output GPU info if drivers are loaded # Install CUDA Toolkit 12.x (required for PyTorch 2.x) wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install -y cuda-toolkit-12-4 # Add CUDA to PATH echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc # Verify CUDA installation nvcc --version

Set Up Docker with NVIDIA Container Toolkit

For reproducible ML environments, run your models inside Docker containers with GPU passthrough:

# Install Docker curl -fsSL https://get.docker.com | sh sudo systemctl enable docker # Install NVIDIA Container Toolkit distribution=$(. /etc/os-release;echo $ID$VERSION_ID) wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution_amd64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt update sudo apt install -y nvidia-container-toolkit # Configure Docker to use NVIDIA runtime by default sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker # Test GPU access from inside a container docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Important: Always use container images built for your specific CUDA version. Mixing CUDA 11.x containers with a CUDA 12.x host will cause runtime errors. Stick to nvidia/cuda:12.x.x-base-ubuntu22.04 for PyTorch and TensorFlow containers.

Step 3: Deploy a PyTorch Model on Vultr GPU

Let's put the GPU to work with a real example: running inference with a fine-tuned Mistral 7B model for a chatbot backend.

Pull a GPU-Optimized PyTorch Container

# Create project directory mkdir ml-server && cd ml-server # Create Dockerfile cat > Dockerfile << 'EOF' FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime WORKDIR /app # Install inference dependencies RUN pip install transformers accelerate bitsandbytes scipy \ fastapi uvicorn sentencepiece COPY server.py . EXPOSE 8000 CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"] EOF

Build the FastAPI Server

# server.py - FastAPI ML inference server from fastapi import FastAPI, HTTPException from pydantic import BaseModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch app = FastAPI() # Load model and tokenizer (downloads on first run) MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" print("Loading model...") tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.float16, device_map="auto", load_in_4bit=True, # Quantize to fit in 40GB ) print("Model loaded successfully!") class InferenceRequest(BaseModel): prompt: str max_new_tokens: int = 256 temperature: float = 0.7 @app.post("/infer") def infer(request: InferenceRequest): inputs = tokenizer(request.prompt, return_tensors="pt").to("cuda") outputs = model.generate( **inputs, max_new_tokens=request.max_new_tokens, temperature=request.temperature, do_sample=True, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) return {"response": response, "model": MODEL_NAME} @app.get("/health") def health(): return {"status": "ok", "gpu": torch.cuda.get_device_name(0)} if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)

Build, Run, and Test

# Build the container docker build -t ml-inference-server . # Run with GPU access docker run -d \ --name ml-server \ --restart unless-stopped \ --gpus all \ -p 8000:8000 \ ml-inference-server # Test the API curl -X POST http://localhost:8000/infer \ -H "Content-Type: application/json" \ -d '{"prompt": "Explain the difference between a GPU and a CPU in simple terms.", "max_new_tokens": 200}' # Check GPU utilization nvidia-smi

With 4-bit quantization, Mistral 7B fits comfortably on a single A100 40GB. For larger models like Llama 3 70B, you'd need an H100 80GB or multi-GPU setup with tensor parallelism.

Real-World Benchmark: Vultr GPU vs AWS

How does Vultr's GPU pricing stack up against AWS EC2? Here's a direct comparison for a training job that takes 8 hours:

ProviderGPUPrice/hr8hr CostVRAM
VultrA100 40GB$1.89$15.1240GB
AWS p4d.24xlargeA100 40GB x8$32.77$262.16320GB total
VultrH100 80GB$4.40$35.2080GB
AWS p5.48xlargeH100 80GB x8$98.32$786.56640GB total

Vultr's single-GPU instances crush AWS on price-per-GPU. For distributed training requiring multiple GPUs, AWS's 8-GPU nodes have an advantage in NVLink bandwidth — but for the vast majority of models, a single Vultr H100 or A100 handles the job at a fraction of the cost.

If you're building a sports analytics platform or need to process large datasets for predictions, check out cloudbet-guide's sports data processing setup for complementary infrastructure patterns.

Optimize GPU Utilization for Inference

Running a GPU at 10% utilization is money down the drain. Here's how to maximize throughput:

# Example: Run vLLM for high-throughput inference docker run -d \ --name vllm-server \ --gpus all \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ vllm/vllm-openai:latest \ --model mistralai/Mistral-7B-Instruct-v0.2 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9

Cost Optimization Strategies

GPU compute is expensive if you waste it. These strategies will keep your bill under control:

Use Spot/Preemptible Instances

Vultr's High Frequency instances can be stopped and started on demand. For fault-tolerant training jobs, implement checkpointing so you can resume from the last saved state if an instance is reclaimed:

# PyTorch checkpointing example import torch checkpoint_interval = 500 # Save every 500 steps def save_checkpoint(model, optimizer, step, loss): checkpoint = { 'step': step, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'loss': loss, } torch.save(checkpoint, f'checkpoint-step-{step}.pt') print(f"Checkpoint saved at step {step}") # During training: for step, batch in enumerate(dataloader): # ... forward/backward pass ... if step % checkpoint_interval == 0: save_checkpoint(model, optimizer, step, loss)

Auto-Shutdown with Watchdog

#!/bin/bash # watchdog.sh - shut down instance if no GPU activity for 30 minutes IDLE_THRESHOLD=1800 # 30 minutes in seconds LOG_FILE=/var/log/gpu_idle.log while true; do GPU_UTIL=$(nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits) if [ "$GPU_UTIL" -lt 1 ]; then IDLE_TIME=$((IDLE_TIME + 60)) if [ $IDLE_TIME -ge $IDLE_THRESHOLD ]; then echo "$(date): GPU idle for ${IDLE_THRESHOLD}s. Shutting down." >> $LOG_FILE sudo shutdown -h now fi else IDLE_TIME=0 fi sleep 60 done
SSH keepalive: If you're using the GPU interactively via SSH, set ServerAliveInterval 60 in your SSH config so the watchdog doesn't trigger while you're actively working.

Start Your First GPU Instance Today

Deploy an L40S, A100, or H100 in minutes. Get $100 free credit when you sign up — no credit card required.

Claim Vultr Free Credit →

Conclusion

Vultr's GPU cloud gives you access to enterprise-grade accelerators without the enterprise price tag. For most ML workloads, an L40S or A100 is the sweet spot between cost and capability. The H100 is reserved for frontier AI research and LLM training at scale.

Per-second billing means you pay only for what you use — a massive advantage over AWS and GCP for development and experimentation where GPU time is intermittent. Spin up an instance, deploy your model, benchmark it, and shut it down when you're done.

Get started with Vultr GPU instances and $100 in free credit.

GPU Vultr Machine Learning AI A100 H100 PyTorch GPU Cloud