System Specification Overview
Component | Specification | Purpose & Optimization |
---|---|---|
Virtualization | Proxmox VE | Enables flexible resource allocation and VM management for LLM testing. |
OS | Ubuntu 22.04 LTS | Stable Linux environment with broad AI/ML support. |
CPU | AMD Ryzen 7 7800X3D / Intel i7 or better | Multi-core processing for AI inference and system stability. |
RAM | ≥ 32GB DDR5 | Prevents bottlenecks when loading large models into memory. |
Storage | 1TB WD Black SN850X NVMe SSD | Fast disk I/O to support large LLM files and smooth execution. |
GPU | NVIDIA RTX 3060 (12GB VRAM) | Accelerates AI inference using CUDA/Tensor cores, ideal for quantized models. |
Power Supply | 850W 80+ Gold (Thermaltake SG850S) | Ensures system stability when running extended AI workloads. |
Cooling | ASUS LC 240 ARGB AIO | Keeps CPU temperatures low during intensive processing. |
Networking | 1 Gbps LAN / Wi-Fi 6 | Ensures smooth access to models via WebUI or API calls. |
AI Workload & LLM Testing Scenarios
This system is optimized for local AI inference, allowing real-time interaction with chat models. The NVIDIA RTX 3060 (12GB) provides acceptable token speeds for most 7B models, while CPU+RAM capacity allows handling of larger models with offloading.
LLM Model | Quantization | GPU Usage | Expected Token Speed (RTX 3060) |
---|---|---|---|
DeepSeek R1 1.5B | Q4 / Q6_K | Low (~3GB) | ~50-100 tokens/sec |
LLaMA 3 7B | Q4 / Q6_K | Medium (~6-10GB) | ~30-60 tokens/sec |
Manus 7B | Q4_K | Medium (~7GB) | ~40-60 tokens/sec |
Phidata LLM | Q4_K | Medium (~7GB) | ~40-70 tokens/sec |
Notes:
- Using quantized models (GGUF Q4/Q6) is recommended for optimal performance on the RTX 3060.
- Token speeds depend on context length, n-gpu-layers settings, and model size.
- Large models (e.g., 13B or 30B LLMs) may require CPU offloading or low-rank adaptation (LoRA) to fit within VRAM limits.
Step-by-Step Deployment Guide
1️⃣ Install & Configure Proxmox VM
- Allocate 8+ CPU cores, 32GB RAM, and 100GB+ SSD storage.
- Passthrough NVIDIA GPU to the VM:
- Enable IOMMU passthrough in Proxmox GRUB settings.
2️⃣ Install NVIDIA Drivers & Docker
-
Verify GPU availability:
-
Install Docker & NVIDIA Container Toolkit:
3️⃣ Deploy Ollama for Local LLM Serving
- Pull models (DeepSeek, LLaMA, Manus, etc.):
4️⃣ Install Open WebUI for Chat Interface
- Access WebUI at:
http://192.168.x.x:3000
5️⃣ Verify GPU Acceleration
- Run a test inference:
- Check GPU usage:
Key Considerations for Performance Optimization
-
Fine-tune
n-gpu-layers
- If VRAM is fully used, lower
n-gpu-layers
(e.g., from 33 → 20). - Example launch:
- If VRAM is fully used, lower
-
Use Quantized GGUF Models
- Prefer Q4_K / Q6_K quantization for balance between speed and accuracy.
- Store models under:
-
Monitor Disk & RAM Usage
- Regularly check disk space:
- Remove unused Docker images:
Expected Outcomes
- Seamless local LLM chat experience via Open WebUI
- Fast token generation (~50-100 tokens/sec)
- Optimized NVIDIA RTX 3060 utilization
- Scalable environment for adding more models
This guide provides a full setup for a Proxmox-based VM with GPU acceleration to run, test, and optimize LLM inference efficiently. You can now load DeepSeek, LLaMA, Manus, and Phidata into Open WebUI and benchmark token speeds.