Proxmox-based VM with NVIDIA RTX 3060 for LLM Inference & Benchmarking

11 March 2025 Written by classMod1

Published in AGI program

This guide outlines the setup and optimization of a Proxmox-based VM using an NVIDIA RTX 3060 (12GB) to test and benchmark local LLMs such as DeepSeek, LLaMA, Manus, and Phidata. The system is configured for high-speed inference, allowing real-time interaction with models via Open WebUI or APIs.

System Specification Overview

Component	Specification	Purpose & Optimization
Virtualization	Proxmox VE	Enables flexible resource allocation and VM management for LLM testing.
OS	Ubuntu 22.04 LTS	Stable Linux environment with broad AI/ML support.
CPU	AMD Ryzen 7 7800X3D / Intel i7 or better	Multi-core processing for AI inference and system stability.
RAM	≥ 32GB DDR5	Prevents bottlenecks when loading large models into memory.
Storage	1TB WD Black SN850X NVMe SSD	Fast disk I/O to support large LLM files and smooth execution.
GPU	NVIDIA RTX 3060 (12GB VRAM)	Accelerates AI inference using CUDA/Tensor cores, ideal for quantized models.
Power Supply	850W 80+ Gold (Thermaltake SG850S)	Ensures system stability when running extended AI workloads.
Cooling	ASUS LC 240 ARGB AIO	Keeps CPU temperatures low during intensive processing.
Networking	1 Gbps LAN / Wi-Fi 6	Ensures smooth access to models via WebUI or API calls.

AI Workload & LLM Testing Scenarios

This system is optimized for local AI inference, allowing real-time interaction with chat models. The NVIDIA RTX 3060 (12GB) provides acceptable token speeds for most 7B models, while CPU+RAM capacity allows handling of larger models with offloading.

LLM Model	Quantization	GPU Usage	Expected Token Speed (RTX 3060)
DeepSeek R1 1.5B	Q4 / Q6_K	Low (~3GB)	~50-100 tokens/sec
LLaMA 3 7B	Q4 / Q6_K	Medium (~6-10GB)	~30-60 tokens/sec
Manus 7B	Q4_K	Medium (~7GB)	~40-60 tokens/sec
Phidata LLM	Q4_K	Medium (~7GB)	~40-70 tokens/sec

Notes:

Using quantized models (GGUF Q4/Q6) is recommended for optimal performance on the RTX 3060.
Token speeds depend on context length, n-gpu-layers settings, and model size.
Large models (e.g., 13B or 30B LLMs) may require CPU offloading or low-rank adaptation (LoRA) to fit within VRAM limits.

Step-by-Step Deployment Guide

1️⃣ Install & Configure Proxmox VM

Allocate 8+ CPU cores, 32GB RAM, and 100GB+ SSD storage.
Passthrough NVIDIA GPU to the VM:

bash

sudo apt install -y pciutils lspci | grep -i nvidia
Enable IOMMU passthrough in Proxmox GRUB settings.

2️⃣ Install NVIDIA Drivers & Docker

Verify GPU availability:

bash

nvidia-smi
Install Docker & NVIDIA Container Toolkit:

bash

sudo apt install -y docker.io distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt update && sudo apt install -y nvidia-container-toolkit

3️⃣ Deploy Ollama for Local LLM Serving

Pull models (DeepSeek, LLaMA, Manus, etc.):

bash

sudo docker exec ollama ollama run deepseek-r1:1.5b sudo docker exec ollama ollama run llama3:7b

4️⃣ Install Open WebUI for Chat Interface

Access WebUI at:
http://192.168.x.x:3000

5️⃣ Verify GPU Acceleration

Run a test inference:

bash

ollama run mistral:latest "What is machine learning?"
Check GPU usage:

bash

nvidia-smi

Key Considerations for Performance Optimization

Fine-tune n-gpu-layers
- If VRAM is fully used, lower n-gpu-layers (e.g., from 33 → 20).
- Example launch:
  
  bash
  
  python server.py --listen --auto-devices --gpu-memory 75 --n-gpu-layers 33
Use Quantized GGUF Models
- Prefer Q4_K / Q6_K quantization for balance between speed and accuracy.
- Store models under:
  
  swift
  
  /usr/share/ollama/.ollama/models
Monitor Disk & RAM Usage
- Regularly check disk space:
  
  bash
  
  df -h
- Remove unused Docker images:
  
  bash
  
  docker system prune -a

Expected Outcomes

Seamless local LLM chat experience via Open WebUI
Fast token generation (~50-100 tokens/sec)
Optimized NVIDIA RTX 3060 utilization
Scalable environment for adding more models

This guide provides a full setup for a Proxmox-based VM with GPU acceleration to run, test, and optimize LLM inference efficiently. You can now load DeepSeek, LLaMA, Manus, and Phidata into Open WebUI and benchmark token speeds.

Read 535 times Last modified on Tuesday, 11 March 2025 12:12

Rate this item

(0 votes)

Tagged under

font size decrease font size increase font size
Print
Email

classMod1

Latest from classMod1

Comment moderation has been enabled. All comments must be approved by the blog author.

Arabic	Hebrew	Polish
Bulgarian	Hindi	Portuguese
Catalan	Hmong Daw	Romanian
Chinese Simplified	Hungarian	Russian
Chinese Traditional	Indonesian	Slovak
Czech	Italian	Slovenian
Danish	Japanese	Spanish
Dutch	Klingon	Swedish
English	Korean	Thai
Estonian	Latvian	Turkish
Finnish	Lithuanian	Ukrainian
French	Malay	Urdu
German	Maltese	Vietnamese
Greek	Norwegian	Welsh
Haitian Creole	Persian

Arabic	Hebrew	Polish
Bulgarian	Hindi	Portuguese
Catalan	Hmong Daw	Romanian
Chinese Simplified	Hungarian	Russian
Chinese Traditional	Indonesian	Slovak
Czech	Italian	Slovenian
Danish	Japanese	Spanish
Dutch	Klingon	Swedish
English	Korean	Thai
Estonian	Latvian	Turkish
Finnish	Lithuanian	Ukrainian
French	Malay	Urdu
German	Maltese	Vietnamese
Greek	Norwegian	Welsh
Haitian Creole	Persian