Proxmox-based VM with NVIDIA RTX 3060 for LLM Inference & Benchmarking

11 March 2025 Written by 
Published in AGI program

This guide outlines the setup and optimization of a Proxmox-based VM using an NVIDIA RTX 3060 (12GB) to test and benchmark local LLMs such as DeepSeek, LLaMA, Manus, and Phidata. The system is configured for high-speed inference, allowing real-time interaction with models via Open WebUI or APIs.


System Specification Overview

ComponentSpecificationPurpose & Optimization
Virtualization Proxmox VE Enables flexible resource allocation and VM management for LLM testing.
OS Ubuntu 22.04 LTS Stable Linux environment with broad AI/ML support.
CPU AMD Ryzen 7 7800X3D / Intel i7 or better Multi-core processing for AI inference and system stability.
RAM ≥ 32GB DDR5 Prevents bottlenecks when loading large models into memory.
Storage 1TB WD Black SN850X NVMe SSD Fast disk I/O to support large LLM files and smooth execution.
GPU NVIDIA RTX 3060 (12GB VRAM) Accelerates AI inference using CUDA/Tensor cores, ideal for quantized models.
Power Supply 850W 80+ Gold (Thermaltake SG850S) Ensures system stability when running extended AI workloads.
Cooling ASUS LC 240 ARGB AIO Keeps CPU temperatures low during intensive processing.
Networking 1 Gbps LAN / Wi-Fi 6 Ensures smooth access to models via WebUI or API calls.

AI Workload & LLM Testing Scenarios

This system is optimized for local AI inference, allowing real-time interaction with chat models. The NVIDIA RTX 3060 (12GB) provides acceptable token speeds for most 7B models, while CPU+RAM capacity allows handling of larger models with offloading.

LLM ModelQuantizationGPU UsageExpected Token Speed (RTX 3060)
DeepSeek R1 1.5B Q4 / Q6_K Low (~3GB) ~50-100 tokens/sec
LLaMA 3 7B Q4 / Q6_K Medium (~6-10GB) ~30-60 tokens/sec
Manus 7B Q4_K Medium (~7GB) ~40-60 tokens/sec
Phidata LLM Q4_K Medium (~7GB) ~40-70 tokens/sec

Notes:

  • Using quantized models (GGUF Q4/Q6) is recommended for optimal performance on the RTX 3060.
  • Token speeds depend on context length, n-gpu-layers settings, and model size.
  • Large models (e.g., 13B or 30B LLMs) may require CPU offloading or low-rank adaptation (LoRA) to fit within VRAM limits.

Step-by-Step Deployment Guide

1️⃣ Install & Configure Proxmox VM

  1. Allocate 8+ CPU cores, 32GB RAM, and 100GB+ SSD storage.
  2. Passthrough NVIDIA GPU to the VM:
    bash
    sudo apt install -y pciutils lspci | grep -i nvidia
  3. Enable IOMMU passthrough in Proxmox GRUB settings.

2️⃣ Install NVIDIA Drivers & Docker

bash
sudo apt update && sudo apt upgrade -y sudo apt install -y nvidia-driver-535 cuda-toolkit sudo reboot
  • Verify GPU availability:

    bash
    nvidia-smi
  • Install Docker & NVIDIA Container Toolkit:

    bash
    sudo apt install -y docker.io distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit

3️⃣ Deploy Ollama for Local LLM Serving

bash
docker volume create ollama_models docker run -d --gpus all -v ollama_data:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
  • Pull models (DeepSeek, LLaMA, Manus, etc.):
    bash
    sudo docker exec ollama ollama run deepseek-r1:1.5b sudo docker exec ollama ollama run llama3:7b

4️⃣ Install Open WebUI for Chat Interface

bash
docker run -d -p 3000:3000 -v openwebui_data:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main

5️⃣ Verify GPU Acceleration

  • Run a test inference:
    bash
    ollama run mistral:latest "What is machine learning?"
  • Check GPU usage:
    bash
    nvidia-smi

Key Considerations for Performance Optimization

  1. Fine-tune n-gpu-layers

    • If VRAM is fully used, lower n-gpu-layers (e.g., from 33 → 20).
    • Example launch:
      bash
      python server.py --listen --auto-devices --gpu-memory 75 --n-gpu-layers 33
  2. Use Quantized GGUF Models

    • Prefer Q4_K / Q6_K quantization for balance between speed and accuracy.
    • Store models under:
      swift
      /usr/share/ollama/.ollama/models
  3. Monitor Disk & RAM Usage

    • Regularly check disk space:
      bash
      df -h
    • Remove unused Docker images:
      bash
      docker system prune -a

Expected Outcomes

  • Seamless local LLM chat experience via Open WebUI
  • Fast token generation (~50-100 tokens/sec)
  • Optimized NVIDIA RTX 3060 utilization
  • Scalable environment for adding more models

This guide provides a full setup for a Proxmox-based VM with GPU acceleration to run, test, and optimize LLM inference efficiently. You can now load DeepSeek, LLaMA, Manus, and Phidata into Open WebUI and benchmark token speeds.

 

 

 

 

 

 

 

 

Read 233 times Last modified on Tuesday, 11 March 2025 12:12
Rate this item
(0 votes)

Leave a comment

Comment moderation has been enabled. All comments must be approved by the blog author.