📚

Recommended Stacks

Pre-configured combinations of launcher + engine + format + quantization for your hardware.

🟢 NVIDIA 🍎 Mac (Apple Silicon)💻 CPU Only 🔴 AMD

🟢 NVIDIA Stacks

NVIDIA GPU users - largest user base, clear VRAM-based recommendations

8GB VRAM (RTX 3060/3070, RTX 4060)

8GB VRAM

Beginner

Stack

ollama + llama.cpp

Formats

gguf

Quantization

Q4_K_M

Install

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:3b

💡 7B models (Q4) run smoothly. 13B is challenging

GUI

Stack

lm-studio + llama.cpp

Formats

gguf

Quantization

Q4_K_M

Install

Download from lmstudio.ai

💡 Easy GUI. Simple model management

Power

Stack

text-generation-webui + llama.cpp

Formats

gguf, gptq

Quantization

Q4_K_M or GPTQ-4bit

💡 For users who need fine-grained control

12GB VRAM (RTX 3060 12GB, RTX 4070)

12GB VRAM

Beginner

Stack

ollama + llama.cpp

Formats

gguf

Quantization

Q4_K_M to Q5_K_M

Install

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

💡 7-8B models (Q5/Q6) run smoothly. 13B possible

GUI

Stack

lm-studio + llama.cpp

Formats

gguf

Quantization

Q5_K_M

💡 13B models work well

Power

Stack

open-webui + ollama

Formats

gguf

Quantization

Q5_K_M

💡 ChatGPT-style UI + Ollama backend

16GB VRAM (RTX 4080, RTX 4070 Ti Super)

16GB VRAM

Beginner

Stack

ollama + llama.cpp

Formats

gguf

Quantization

Q5_K_M to Q6_K

Install

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

💡 8B-13B run smoothly. Maximum performance with high-quality quants

GUI

Stack

lm-studio + llama.cpp

Formats

gguf

Quantization

Q6_K

💡 13B with high-quality quantization

Power

Stack

vllm + vllm

Formats

safetensors, awq

Quantization

AWQ-4bit

Install

pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct

💡 For production server use. Supports batching

24GB+ VRAM (RTX 3090, RTX 4090, A5000)

24GB VRAM

Beginner

Stack

ollama + llama.cpp

Formats

gguf

Quantization

Q6_K to Q8

Install

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:70b-instruct-q4_K_M

💡 70B (Q4) works. Ultimate local setup

GUI

Stack

lm-studio + llama.cpp

Formats

gguf

Quantization

💡 High-quality quantization even for 70B

Power

Stack

vllm + vllm

Formats

safetensors, awq, gptq

Quantization

FP16 or AWQ

Install

pip install vllm
vllm serve meta-llama/Llama-3.1-70B-Instruct --tensor-parallel-size 1

💡 70B FP16 possible. Production server ready