Last Updated: 3/19/2026

Installation

This guide covers how to install vLLM on different hardware platforms and operating systems.

Prerequisites

Before installing vLLM, ensure your system meets the following requirements:

Operating System: Linux (Ubuntu 20.04 or later recommended)
Python: 3.10, 3.11, 3.12, or 3.13
Hardware: Compatible GPU or TPU (see platform-specific sections below)

Installation Methods

NVIDIA CUDA

If you are using NVIDIA GPUs, you can install vLLM directly using pip or uv.

Using uv (Recommended)

uv is a very fast Python environment manager that can automatically select the appropriate PyTorch backend based on your CUDA driver version.

First, install uv by following the official documentation . Then create a Python environment and install vLLM:


uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

The --torch-backend=auto flag (or UV_TORCH_BACKEND=auto environment variable) automatically detects your CUDA version and selects the appropriate PyTorch index. To specify a particular CUDA version, use --torch-backend=cu126 for CUDA 12.6, for example.

You can also run vLLM commands without creating a permanent environment using uv run:


uv run --with vllm vllm --help

Using pip

For a standard pip installation:


pip install vllm

Using conda

You can also use conda to manage your Python environment:


conda create -n myenv python=3.12 -y
conda activate myenv
pip install --upgrade uv
uv pip install vllm --torch-backend=auto

AMD ROCm

For AMD GPUs, use uv to install vLLM with the ROCm-specific wheel:


uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Note: Currently supports Python 3.12, ROCm 7.0, and glibc >= 2.35.

Google TPU

To run vLLM on Google Cloud TPUs, install the vllm-tpu package:


uv pip install vllm-tpu

For detailed instructions including Docker setup, building from source, and troubleshooting, refer to the vLLM on TPU documentation .

Building from Source

For development or to build custom CUDA kernels, you can build vLLM from source.

Prerequisites for Building

Install build dependencies:


pip install --upgrade pip
pip install cmake>=3.26.1 ninja packaging>=24.2 setuptools>=77.0.3 wheel torch

Clone and Build


git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

This will compile the CUDA kernels and install vLLM in editable mode.

Build Configuration

You can customize the build with environment variables:

TORCH_CUDA_ARCH_LIST: Specify target GPU architectures (e.g., "7.0 7.5 8.0 8.9 9.0")
MAX_JOBS: Control parallel compilation jobs (default: 2)
NVCC_THREADS: Number of threads for nvcc (default: 8)

Example:


TORCH_CUDA_ARCH_LIST="8.0 9.0" MAX_JOBS=4 pip install -e .

Hardware Requirements

NVIDIA GPUs

Compute Capability: 7.0 or higher (Volta, Turing, Ampere, Ada, Hopper, Blackwell architectures)
CUDA: 12.1 or later recommended
VRAM: Depends on model size; at least 16GB recommended for 7B models

AMD GPUs

ROCm: 7.0 or later
Supported GPUs: MI200 series, MI300 series

Google TPUs

TPU Version: v5e, v5p, or later
Environment: Google Cloud TPU VMs

Verifying Installation

After installation, verify that vLLM is working correctly:


python -c "import vllm; print(vllm.__version__)"

You can also run a simple inference test:


python -c "from vllm import LLM; llm = LLM('facebook/opt-125m'); print(llm.generate('Hello, my name is'))"

Docker Installation

vLLM provides official Docker images for easy deployment. Pull the latest image:


docker pull vllm/vllm-openai:latest

Run the container:


docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model facebook/opt-125m

Troubleshooting

CUDA Out of Memory

If you encounter CUDA out-of-memory errors, try:

Reducing --gpu-memory-utilization (default: 0.9)
Decreasing --max-num-seqs to limit concurrent requests
Using a smaller model or quantized version

Import Errors

If you see import errors after installation:

Ensure your Python version is 3.10-3.13
Verify CUDA is properly installed: nvidia-smi
Check that PyTorch can access CUDA: python -c "import torch; print(torch.cuda.is_available())"

Build Failures

When building from source:

Ensure you have sufficient disk space (at least 10GB free)
Install required system packages: apt-get install -y build-essential python3-dev
Check that CUDA toolkit is installed and nvcc is in your PATH

What’s Next

Quick Start: Run your first offline batch inference and start the OpenAI-compatible server in minutes.
Supported Models: Browse the full list of model families vLLM supports, including Llama, Mistral, Qwen, and multimodal models.