How to Deploy ML Model on Vultr: Complete Production Guide 2026
You've trained your machine learning model. Now comes the real challenge: deploying it to production where it can serve predictions at scale. Whether you're building an API for inference or serving a real-time prediction endpoint, Vultr's GPU instances provide the compute power you need at a fraction of cloud giant prices.
This guide walks through deploying a trained ML model on Vultr—from preparing your model artifact to building a production-ready API with Flask or FastAPI, containerizing with Docker, and configuring for high availability.
Why Deploy ML Models on Vultr?
Before diving into the technical steps, let's cover why Vultr makes sense for ML deployment:
- GPU-Powered Inference: NVIDIA A100 and H100 GPUs accelerate inference by 10-50x compared to CPU-only solutions
- Per-Hour Billing: Pay only for what you use—ideal for variable traffic patterns
- Global Locations: Deploy close to your users with 25+ data centers worldwide
- Competitive Pricing: Starting at $0.018/hour for GPU instances—significantly cheaper than AWS or GCP
For context, a similar GPU-enabled deployment on AWS g4dn.xlarge costs approximately $0.526/hour—nearly 30x Vultr's pricing for equivalent compute.
Prerequisites
Before starting, ensure you have:
- A trained model (PyTorch .pt, TensorFlow .h5, or ONNX format)
- Vultr account with GPU instance deployed
- SSH access to your Vultr instance
- Basic Python knowledge
Step 1: Prepare Your Model for Deployment
First, prepare your trained model for serving. We recommend the ONNX format for maximum compatibility:
Convert to ONNX (PyTorch Example)
import torch
import torch.onnx as torch_onnx
# Load your trained model
model = YourModelClass()
model.load_state_dict(torch.load("model.pt"))
model.eval()
# Define dummy input matching your model's input shape
dummy_input = torch.randn(1, 3, 224, 224)
# Export to ONNX
torch_onnx.export(
model,
dummy_input,
"model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)
ONNX provides faster inference runtime and cross-platform compatibility. You can also keep your model in PyTorch format if you prefer native support.
Step 2: Set Up Python Environment
Connect to your Vultr GPU instance via SSH and set up the Python environment:
# Update system
sudo apt update && sudo apt upgrade -y
# Install Python and virtual environment
sudo apt install python3.3-venv python3-pip -y
# Create virtual environment
python3 -m venv ml-env
source ml-env/bin/activate
# Install dependencies
pip install torch torchvision onnxruntime-gpu flask gunicorn
For CUDA acceleration, install onnxruntime-gpu instead of the CPU-only version. This leverages your Vultr GPU for inference.
Step 3: Create Flask API for Inference
Create a Flask application to serve predictions:
# Create app.py
from flask import Flask, request, jsonify
import torch
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
app = Flask(__name__)
# Load model
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
def preprocess_image(image_bytes):
"""Preprocess input image for model inference."""
img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
img = img.resize((224, 224))
img_array = np.array(img).astype(np.float32) / 255.0
img_array = np.transpose(img_array, (2, 0, 1))
img_array = np.expand_dims(img_array, axis=0)
return img_array
@app.route("/predict", methods=["POST"])
def predict():
"""Prediction endpoint."""
if "image" not in request.files:
return jsonify({"error": "No image provided"}), 400
image_file = request.files["image"]
image_bytes = image_file.read()
# Preprocess and predict
input_data = preprocess_image(image_bytes)
output = session.run(None, {"input": input_data})
# Return prediction
prediction = output[0].argmax()
confidence = float(output[0].max())
return jsonify({
"prediction": int(prediction),
"confidence": confidence
})
@app.route("/health", methods=["GET"])
def health():
"""Health check endpoint."""
return jsonify({"status": "healthy"})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
Step 4: Test Your Deployment
Run the Flask application locally to test:
python app.py
Test with a sample image using curl:
curl -X POST -F "image=@test.jpg" http://localhost:5000/predict
You should receive a JSON response with prediction and confidence scores.
Step 5: Containerize with Docker
For production deployment, containerize your application:
# Create Dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and application
COPY model.onnx .
COPY app.py .
# Expose port
EXPOSE 5000
# Run with Gunicorn for production
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "app:app"]
# Create requirements.txt
torch
onnxruntime-gpu
flask
gunicorn
Pillow
numpy
# Build and run Docker container
docker build -t ml-model-serving .
docker run -d --gpus all -p 5000:5000 ml-model-serving
Step 6: Configure Production Server
For handling production traffic, set up Gunicorn with multiple workers:
# Install Gunicorn if not already installed
pip install gunicorn
# Run with 4 workers, each handling requests
gunicorn --bind 0.0.0.0:5000 --workers 4 --threads 2 app:app
For high availability, consider:
- Load Balancing: Use nginx as a reverse proxy
- Auto-scaling: Deploy behind Kubernetes for automatic scaling
- Monitoring: Set up Prometheus metrics for inference latency
Step 7: Set Up Nginx Reverse Proxy
Configure nginx for better performance and security:
sudo apt install nginx -y
# Create nginx configuration
sudo tee /etc/nginx/sites-available/ml-api << EOF
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
EOF
sudo ln -s /etc/nginx/sites-available/ml-api /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx
Production Best Practices
When deploying ML models in production, consider these best practices:
- Model Versioning: Use a model registry to track versions
- Input Validation: Validate input shapes and types before inference
- Rate Limiting: Prevent abuse with rate limiting
- Logging: Track inference requests for debugging
- Metrics: Monitor latency, throughput, and errors
Cost Optimization
To minimize costs on Vultr:
- Use Spot Instances: For non-critical workloads, spot instances can save up to 70%
- Auto-shutdown: Turn off instances during off-hours
- Right-size Resources: Start with smaller GPU instances and scale as needed
- Monitor Usage: Track spending with Vultr's built-in billing alerts
Alternative: FastAPI for Better Performance
For even better performance, consider FastAPI:
pip install fastapi uvicorn
# app-fastapi.py
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
app = FastAPI()
# Load model
session = ort.InferenceSession(
"model.onnx",
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
@app.post("/predict")
async def predict(image: UploadFile = File(...)):
"""FastAPI prediction endpoint."""
image_bytes = await image.read()
# ... preprocessing logic
output = session.run(None, {"input": input_data})
return {"prediction": int(output[0].argmax())}
FastAPI provides automatic documentation, better async support, and faster request handling compared to Flask.
Conclusion
Deploying ML models on Vultr is straightforward when you follow these steps. From model preparation to containerized deployment, Vultr's GPU instances provide the compute power you need at competitive prices.
Key takeaways:
- Convert your model to ONNX for better performance
- Use GPU-enabled ONNX Runtime for accelerated inference
- Containerize with Docker for easy deployment
- Set up nginx for production traffic handling
- Monitor costs and scale as needed
Ready to deploy your ML model? Get started with Vultr GPU instances and start serving predictions in minutes.
For more tutorials, check out our guide on Vultr GPU Instances for AI Development or explore Cloudbet sports betting guide for AI-powered betting strategies.