From zero to running Stable Diffusion or Llama inference in under 20 minutes
Training or running inference on large language models, diffusion models, and neural networks requires serious compute. Traditional CPU instances choke on anything beyond basic ML โ that's where GPU cloud instances change everything.
Vultr has emerged as the go-to choice for developers who need GPU power without the AWS/Azure price shock. Their GPU instances start at $0.04/hr for a NVIDIA T4, with A100 and H100 options for serious workloads. No long-term commitment, no surprise bills.
Whether you're running cloudbet-guide recommendation models, building a local Llama chatbot, or generating images with Stable Diffusion, Vultr's bare-metal GPU instances deliver the throughput you need at a fraction of the hyperscaler cost.
Here's the full picture of Vultr's GPU lineup as of 2026:
| GPU | vCPU | RAM | Storage | Price/hr | Best For |
|---|---|---|---|---|---|
| NVIDIA T4 | 4 | 16 GB | 200 GB SSD | $0.04 | Light inference, whisper, small models |
| NVIDIA L4 | 8 | 32 GB | 400 GB NVMe | $0.11 | Stable Diffusion, mid-size LLMs |
| NVIDIA A100 1GPU | 16 | 64 GB | 512 GB NVMe | $0.22 | LLM inference, fine-tuning, training |
| NVIDIA A100 4GPU | 32 | 256 GB | 1 TB NVMe | $0.89 | Multi-GPU training, large model serving |
| NVIDIA H100 | 32 | 128 GB | 1 TB NVMe | $0.77 | Cutting-edge AI research, frontier models |
Compared to Vultr vs AWS comparison 2026, Vultr GPU pricing is 40-60% cheaper for equivalent raw compute. AWS p3.2xlarge (V100) runs $3.06/hr vs Vultr A100 at $0.22/hr.
Sign up at vultr.com and navigate to the Compute section. If you're new, the $100 free credit on the startup program is a great way to test GPU instances risk-free.
Click "Add Instance" โ Select Cloud GPU tab. Choose:
Add additional block storage if you're working with large model files (Stable Diffusion checkpoints can be 5-10 GB each). Use the default 20Gbps network โ more than enough for inference serving.
After deployment (typically 2-3 minutes), you'll receive root SSH credentials. Connect:
ssh root@your-vultr-ip
# You'll see Ubuntu 22.04 loading on first boot
Fresh Ubuntu doesn't come with NVIDIA drivers. Here's the one-command setup:
# Add NVIDIA's official repository
apt update && apt install -y software-properties-common
add-apt-repository -y ppa:graphics-drivers/ppa
apt update
# Install drivers + CUDA toolkit
apt install -y nvidia-driver-545 nvidia-cuda-toolkit
# Verify installation
nvidia-smi
You should see a table showing your GPU, VRAM, temperature, and power draw. This confirms the card is recognized and drivers are working.
nvidia-smi fails, reboot first: reboot. GPU driver installation requires a kernel module reload that's done at boot.
Perfect for image generation, fine-tuning, and running custom checkpoints.
# Install Python and dependencies
apt install -y python3.10-venv git
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui
# Download a model (SDXL base ~6.5GB)
# Use huggingface_hub or directly from civitai
# Launch (runs on port 7860)
./webui.sh --listen --xformers
Once running, access the WebUI at http://your-vultr-ip:7860. With an L4 GPU, you can generate 1024ร1024 images at ~8 seconds per step.
Ollama is the fastest way to get LLM inference running. No Docker, no complicated setup.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run Llama 3 8B (uses ~6GB VRAM on T4)
ollama run llama3
# Or Mistral 7B
ollama run mistral
# Run a quantized model (faster, less VRAM)
ollama run llama3:70b-q4_0
After ollama run llama3, you get an interactive chat prompt in your terminal. For API access, Ollama runs a REST server on port 11434 by default.
| Metric | T4 (16GB VRAM) | L4 (24GB VRAM) |
|---|---|---|
| Tokens/sec (FP16) | 18 t/s | 42 t/s |
| Time to first token | 1.8s | 0.9s |
| Batch size (max) | 512 ctx | 2048 ctx |
| $/hour vs accuracy | $0.04 | $0.11 |
The L4 delivers 2.3x throughput for 2.75x the cost โ good efficiency for serious workloads.
Running GPU workloads 24/7 gets expensive. Here's how smart developers cut costs:
Vultr's flexible instances can be 70% cheaper than on-demand. Your job gets interrupted โ design for checkpoint/resume or use for batch processing.
# In Vultr cloud-init userdata field:
#cloud-config
runcmd:
- systemctl start shutdowntimer
- echo "systemctl poweroff" | at now + 6 hours
This ensures your $0.04/hr experiment doesn't accidentally run for a month at full billing.
Use 4-bit or 8-bit quantized versions of LLMs โ they fit in smaller GPU memory while sacrificing minimal accuracy. Q4_K_M quantization of Llama 3 70B runs on a single A100 80GB instead of needing 4 GPUs.
Don't keep a GPU instance running waiting for requests. Use a queue (Redis + FastAPI) and spin up instances on-demand via Vultr's API. Scale to zero when idle.
| Provider | Cheapest GPU | A100/hr | H100/hr | API Support |
|---|---|---|---|---|
| Vultr | T4 @ $0.04 | $0.22 | $0.77 | Full REST API |
| AWS | T4 @ $0.35 | $1.54 | $2.00 | EC2 + SageMaker |
| Paperspace | T4 @ $0.20 | $0.60 | $1.20 | Core API |
| Lambda Labs | T4 @ $0.04 | $0.30 | N/A | Full API |
Vultr wins on price-to-performance for both budget (T4) and premium (H100) tiers. AWS charges a premium for ecosystem integration you may not need.
Get started with Vultr's GPU cloud โ from $0.04/hr with global availability and no commitments.
๐ Deploy GPU Instance Now๐ฏ Recommended Betting Platforms