Last Updated: 3/19/2026
Installation
This guide covers how to install vLLM on different hardware platforms and operating systems.
Prerequisites
Before installing vLLM, ensure your system meets the following requirements:
- Operating System: Linux (Ubuntu 20.04 or later recommended)
- Python: 3.10, 3.11, 3.12, or 3.13
- Hardware: Compatible GPU or TPU (see platform-specific sections below)
Installation Methods
NVIDIA CUDA
If you are using NVIDIA GPUs, you can install vLLM directly using pip or uv.
Using uv (Recommended)
uv is a very fast Python environment manager that can automatically select the appropriate PyTorch backend based on your CUDA driver version.
First, install uv by following the official documentation . Then create a Python environment and install vLLM:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=autoThe --torch-backend=auto flag (or UV_TORCH_BACKEND=auto environment variable) automatically detects your CUDA version and selects the appropriate PyTorch index. To specify a particular CUDA version, use --torch-backend=cu126 for CUDA 12.6, for example.
You can also run vLLM commands without creating a permanent environment using uv run:
uv run --with vllm vllm --helpUsing pip
For a standard pip installation:
pip install vllmUsing conda
You can also use conda to manage your Python environment:
conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=autoAMD ROCm
For AMD GPUs, use uv to install vLLM with the ROCm-specific wheel:
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/Note: Currently supports Python 3.12, ROCm 7.0, and glibc >= 2.35.
Google TPU
To run vLLM on Google Cloud TPUs, install the vllm-tpu package:
uv pip install vllm-tpuFor detailed instructions including Docker setup, building from source, and troubleshooting, refer to the vLLM on TPU documentation .
Building from Source
For development or to build custom CUDA kernels, you can build vLLM from source.
Prerequisites for Building
Install build dependencies:
pip install --upgrade pip
pip install cmake>=3.26.1 ninja packaging>=24.2 setuptools>=77.0.3 wheel torchClone and Build
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .This will compile the CUDA kernels and install vLLM in editable mode.
Build Configuration
You can customize the build with environment variables:
TORCH_CUDA_ARCH_LIST: Specify target GPU architectures (e.g.,"7.0 7.5 8.0 8.9 9.0")MAX_JOBS: Control parallel compilation jobs (default: 2)NVCC_THREADS: Number of threads for nvcc (default: 8)
Example:
TORCH_CUDA_ARCH_LIST="8.0 9.0" MAX_JOBS=4 pip install -e .Hardware Requirements
NVIDIA GPUs
- Compute Capability: 7.0 or higher (Volta, Turing, Ampere, Ada, Hopper, Blackwell architectures)
- CUDA: 12.1 or later recommended
- VRAM: Depends on model size; at least 16GB recommended for 7B models
AMD GPUs
- ROCm: 7.0 or later
- Supported GPUs: MI200 series, MI300 series
Google TPUs
- TPU Version: v5e, v5p, or later
- Environment: Google Cloud TPU VMs
Verifying Installation
After installation, verify that vLLM is working correctly:
python -c "import vllm; print(vllm.__version__)"You can also run a simple inference test:
python -c "from vllm import LLM; llm = LLM('facebook/opt-125m'); print(llm.generate('Hello, my name is'))"Docker Installation
vLLM provides official Docker images for easy deployment. Pull the latest image:
docker pull vllm/vllm-openai:latestRun the container:
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model facebook/opt-125mTroubleshooting
CUDA Out of Memory
If you encounter CUDA out-of-memory errors, try:
- Reducing
--gpu-memory-utilization(default: 0.9) - Decreasing
--max-num-seqsto limit concurrent requests - Using a smaller model or quantized version
Import Errors
If you see import errors after installation:
- Ensure your Python version is 3.10-3.13
- Verify CUDA is properly installed:
nvidia-smi - Check that PyTorch can access CUDA:
python -c "import torch; print(torch.cuda.is_available())"
Build Failures
When building from source:
- Ensure you have sufficient disk space (at least 10GB free)
- Install required system packages:
apt-get install -y build-essential python3-dev - Check that CUDA toolkit is installed and
nvccis in your PATH
What’s Next
- Quick Start: Run your first offline batch inference and start the OpenAI-compatible server in minutes.
- Supported Models: Browse the full list of model families vLLM supports, including Llama, Mistral, Qwen, and multimodal models.