Skip to Content
getting-startedInstallation

Last Updated: 3/19/2026


Installation

This guide covers how to install vLLM on different hardware platforms and operating systems.

Prerequisites

Before installing vLLM, ensure your system meets the following requirements:

  • Operating System: Linux (Ubuntu 20.04 or later recommended)
  • Python: 3.10, 3.11, 3.12, or 3.13
  • Hardware: Compatible GPU or TPU (see platform-specific sections below)

Installation Methods

NVIDIA CUDA

If you are using NVIDIA GPUs, you can install vLLM directly using pip or uv.

uv  is a very fast Python environment manager that can automatically select the appropriate PyTorch backend based on your CUDA driver version.

First, install uv by following the official documentation . Then create a Python environment and install vLLM:

uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --torch-backend=auto

The --torch-backend=auto flag (or UV_TORCH_BACKEND=auto environment variable) automatically detects your CUDA version and selects the appropriate PyTorch index. To specify a particular CUDA version, use --torch-backend=cu126 for CUDA 12.6, for example.

You can also run vLLM commands without creating a permanent environment using uv run:

uv run --with vllm vllm --help

Using pip

For a standard pip installation:

pip install vllm

Using conda

You can also use conda to manage your Python environment:

conda create -n myenv python=3.12 -y conda activate myenv pip install --upgrade uv uv pip install vllm --torch-backend=auto

AMD ROCm

For AMD GPUs, use uv to install vLLM with the ROCm-specific wheel:

uv venv --python 3.12 --seed source .venv/bin/activate uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm/

Note: Currently supports Python 3.12, ROCm 7.0, and glibc >= 2.35.

Google TPU

To run vLLM on Google Cloud TPUs, install the vllm-tpu package:

uv pip install vllm-tpu

For detailed instructions including Docker setup, building from source, and troubleshooting, refer to the vLLM on TPU documentation .

Building from Source

For development or to build custom CUDA kernels, you can build vLLM from source.

Prerequisites for Building

Install build dependencies:

pip install --upgrade pip pip install cmake>=3.26.1 ninja packaging>=24.2 setuptools>=77.0.3 wheel torch

Clone and Build

git clone https://github.com/vllm-project/vllm.git cd vllm pip install -e .

This will compile the CUDA kernels and install vLLM in editable mode.

Build Configuration

You can customize the build with environment variables:

  • TORCH_CUDA_ARCH_LIST: Specify target GPU architectures (e.g., "7.0 7.5 8.0 8.9 9.0")
  • MAX_JOBS: Control parallel compilation jobs (default: 2)
  • NVCC_THREADS: Number of threads for nvcc (default: 8)

Example:

TORCH_CUDA_ARCH_LIST="8.0 9.0" MAX_JOBS=4 pip install -e .

Hardware Requirements

NVIDIA GPUs

  • Compute Capability: 7.0 or higher (Volta, Turing, Ampere, Ada, Hopper, Blackwell architectures)
  • CUDA: 12.1 or later recommended
  • VRAM: Depends on model size; at least 16GB recommended for 7B models

AMD GPUs

  • ROCm: 7.0 or later
  • Supported GPUs: MI200 series, MI300 series

Google TPUs

  • TPU Version: v5e, v5p, or later
  • Environment: Google Cloud TPU VMs

Verifying Installation

After installation, verify that vLLM is working correctly:

python -c "import vllm; print(vllm.__version__)"

You can also run a simple inference test:

python -c "from vllm import LLM; llm = LLM('facebook/opt-125m'); print(llm.generate('Hello, my name is'))"

Docker Installation

vLLM provides official Docker images for easy deployment. Pull the latest image:

docker pull vllm/vllm-openai:latest

Run the container:

docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model facebook/opt-125m

Troubleshooting

CUDA Out of Memory

If you encounter CUDA out-of-memory errors, try:

  • Reducing --gpu-memory-utilization (default: 0.9)
  • Decreasing --max-num-seqs to limit concurrent requests
  • Using a smaller model or quantized version

Import Errors

If you see import errors after installation:

  • Ensure your Python version is 3.10-3.13
  • Verify CUDA is properly installed: nvidia-smi
  • Check that PyTorch can access CUDA: python -c "import torch; print(torch.cuda.is_available())"

Build Failures

When building from source:

  • Ensure you have sufficient disk space (at least 10GB free)
  • Install required system packages: apt-get install -y build-essential python3-dev
  • Check that CUDA toolkit is installed and nvcc is in your PATH

What’s Next

  • Quick Start: Run your first offline batch inference and start the OpenAI-compatible server in minutes.
  • Supported Models: Browse the full list of model families vLLM supports, including Llama, Mistral, Qwen, and multimodal models.