You've trained your machine learning model. Now comes the real challenge: deploying it to production where it can serve predictions at scale. Whether you're building an API for inference or serving a real-time prediction endpoint, Vultr's GPU instances provide the compute power you need at a fraction of cloud giant prices.

This guide walks through deploying a trained ML model on Vultr—from preparing your model artifact to building a production-ready API with Flask or FastAPI, containerizing with Docker, and configuring for high availability.

Why Deploy ML Models on Vultr?

Before diving into the technical steps, let's cover why Vultr makes sense for ML deployment:

  • GPU-Powered Inference: NVIDIA A100 and H100 GPUs accelerate inference by 10-50x compared to CPU-only solutions
  • Per-Hour Billing: Pay only for what you use—ideal for variable traffic patterns
  • Global Locations: Deploy close to your users with 25+ data centers worldwide
  • Competitive Pricing: Starting at $0.018/hour for GPU instances—significantly cheaper than AWS or GCP

For context, a similar GPU-enabled deployment on AWS g4dn.xlarge costs approximately $0.526/hour—nearly 30x Vultr's pricing for equivalent compute.

Prerequisites

Before starting, ensure you have:

  • A trained model (PyTorch .pt, TensorFlow .h5, or ONNX format)
  • Vultr account with GPU instance deployed
  • SSH access to your Vultr instance
  • Basic Python knowledge

Step 1: Prepare Your Model for Deployment

First, prepare your trained model for serving. We recommend the ONNX format for maximum compatibility:

Convert to ONNX (PyTorch Example)

import torch
import torch.onnx as torch_onnx

# Load your trained model
model = YourModelClass()
model.load_state_dict(torch.load("model.pt"))
model.eval()

# Define dummy input matching your model's input shape
dummy_input = torch.randn(1, 3, 224, 224)

# Export to ONNX
torch_onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={"input": {0: "batch_size"}, "output": {0: "batch_size"}}
)

ONNX provides faster inference runtime and cross-platform compatibility. You can also keep your model in PyTorch format if you prefer native support.

Step 2: Set Up Python Environment

Connect to your Vultr GPU instance via SSH and set up the Python environment:

# Update system
sudo apt update && sudo apt upgrade -y

# Install Python and virtual environment
sudo apt install python3.3-venv python3-pip -y

# Create virtual environment
python3 -m venv ml-env
source ml-env/bin/activate

# Install dependencies
pip install torch torchvision onnxruntime-gpu flask gunicorn

For CUDA acceleration, install onnxruntime-gpu instead of the CPU-only version. This leverages your Vultr GPU for inference.

Step 3: Create Flask API for Inference

Create a Flask application to serve predictions:

# Create app.py
from flask import Flask, request, jsonify
import torch
import onnxruntime as ort
import numpy as np
from PIL import Image
import io

app = Flask(__name__)

# Load model
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

def preprocess_image(image_bytes):
    """Preprocess input image for model inference."""
    img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    img = img.resize((224, 224))
    img_array = np.array(img).astype(np.float32) / 255.0
    img_array = np.transpose(img_array, (2, 0, 1))
    img_array = np.expand_dims(img_array, axis=0)
    return img_array

@app.route("/predict", methods=["POST"])
def predict():
    """Prediction endpoint."""
    if "image" not in request.files:
        return jsonify({"error": "No image provided"}), 400
    
    image_file = request.files["image"]
    image_bytes = image_file.read()
    
    # Preprocess and predict
    input_data = preprocess_image(image_bytes)
    output = session.run(None, {"input": input_data})
    
    # Return prediction
    prediction = output[0].argmax()
    confidence = float(output[0].max())
    
    return jsonify({
        "prediction": int(prediction),
        "confidence": confidence
    })

@app.route("/health", methods=["GET"])
def health():
    """Health check endpoint."""
    return jsonify({"status": "healthy"})

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

Step 4: Test Your Deployment

Run the Flask application locally to test:

python app.py

Test with a sample image using curl:

curl -X POST -F "image=@test.jpg" http://localhost:5000/predict

You should receive a JSON response with prediction and confidence scores.

Step 5: Containerize with Docker

For production deployment, containerize your application:

# Create Dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and application
COPY model.onnx .
COPY app.py .

# Expose port
EXPOSE 5000

# Run with Gunicorn for production
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "app:app"]
# Create requirements.txt
torch
onnxruntime-gpu
flask
gunicorn
Pillow
numpy
# Build and run Docker container
docker build -t ml-model-serving .
docker run -d --gpus all -p 5000:5000 ml-model-serving

Step 6: Configure Production Server

For handling production traffic, set up Gunicorn with multiple workers:

# Install Gunicorn if not already installed
pip install gunicorn

# Run with 4 workers, each handling requests
gunicorn --bind 0.0.0.0:5000 --workers 4 --threads 2 app:app

For high availability, consider:

  • Load Balancing: Use nginx as a reverse proxy
  • Auto-scaling: Deploy behind Kubernetes for automatic scaling
  • Monitoring: Set up Prometheus metrics for inference latency

Step 7: Set Up Nginx Reverse Proxy

Configure nginx for better performance and security:

sudo apt install nginx -y

# Create nginx configuration
sudo tee /etc/nginx/sites-available/ml-api << EOF
server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://localhost:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
EOF

sudo ln -s /etc/nginx/sites-available/ml-api /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl restart nginx

Production Best Practices

When deploying ML models in production, consider these best practices:

  • Model Versioning: Use a model registry to track versions
  • Input Validation: Validate input shapes and types before inference
  • Rate Limiting: Prevent abuse with rate limiting
  • Logging: Track inference requests for debugging
  • Metrics: Monitor latency, throughput, and errors

Cost Optimization

To minimize costs on Vultr:

  • Use Spot Instances: For non-critical workloads, spot instances can save up to 70%
  • Auto-shutdown: Turn off instances during off-hours
  • Right-size Resources: Start with smaller GPU instances and scale as needed
  • Monitor Usage: Track spending with Vultr's built-in billing alerts

Alternative: FastAPI for Better Performance

For even better performance, consider FastAPI:

pip install fastapi uvicorn

# app-fastapi.py
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import JSONResponse
import onnxruntime as ort
import numpy as np
from PIL import Image
import io

app = FastAPI()

# Load model
session = ort.InferenceSession(
    "model.onnx",
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)

@app.post("/predict")
async def predict(image: UploadFile = File(...)):
    """FastAPI prediction endpoint."""
    image_bytes = await image.read()
    # ... preprocessing logic
    output = session.run(None, {"input": input_data})
    return {"prediction": int(output[0].argmax())}

FastAPI provides automatic documentation, better async support, and faster request handling compared to Flask.

Conclusion

Deploying ML models on Vultr is straightforward when you follow these steps. From model preparation to containerized deployment, Vultr's GPU instances provide the compute power you need at competitive prices.

Key takeaways:

  1. Convert your model to ONNX for better performance
  2. Use GPU-enabled ONNX Runtime for accelerated inference
  3. Containerize with Docker for easy deployment
  4. Set up nginx for production traffic handling
  5. Monitor costs and scale as needed

Ready to deploy your ML model? Get started with Vultr GPU instances and start serving predictions in minutes.

For more tutorials, check out our guide on Vultr GPU Instances for AI Development or explore Cloudbet sports betting guide for AI-powered betting strategies.