Skip to Content
configurationOverview

Last Updated: 3/19/2026


Engine Configuration

Engine arguments control the behavior of the vLLM engine. These arguments are used both for offline inference (via the LLM class) and online serving (via vllm serve).

Common Configuration Options

Model Loading

—model: Model name or path

vllm serve meta-llama/Llama-2-7b-hf

—dtype: Data type for model weights

  • auto: Automatically detect from model config (default)
  • float16: Half precision
  • bfloat16: Brain floating point
  • float32: Full precision
vllm serve meta-llama/Llama-2-7b-hf --dtype bfloat16

—max-model-len: Maximum sequence length

vllm serve meta-llama/Llama-2-7b-hf --max-model-len 4096

—trust-remote-code: Allow executing remote code from model repositories

vllm serve Qwen/Qwen-7B --trust-remote-code

Memory Management

—gpu-memory-utilization: Fraction of GPU memory to use for the model (default: 0.9)

vllm serve meta-llama/Llama-2-7b-hf --gpu-memory-utilization 0.85

Lower values leave more memory for other processes but may reduce throughput.

—max-num-seqs: Maximum number of sequences to process in a single batch (default: 256)

vllm serve meta-llama/Llama-2-7b-hf --max-num-seqs 128

—max-num-batched-tokens: Maximum number of tokens to process in a single batch (default: varies by hardware)

vllm serve meta-llama/Llama-2-7b-hf --max-num-batched-tokens 8192

Serving Options

—host: Server host address (default: localhost)

vllm serve meta-llama/Llama-2-7b-hf --host 0.0.0.0

—port: Server port (default: 8000)

vllm serve meta-llama/Llama-2-7b-hf --port 8080

—api-key: API key for authentication (can be specified multiple times)

vllm serve meta-llama/Llama-2-7b-hf --api-key secret-key-1 --api-key secret-key-2

Distributed Inference

Tensor Parallelism

Split the model across multiple GPUs:

vllm serve meta-llama/Llama-2-70b-hf --tensor-parallel-size 4

This splits the model’s layers across 4 GPUs.

Pipeline Parallelism

Split the model into stages across multiple GPUs:

vllm serve meta-llama/Llama-2-70b-hf --pipeline-parallel-size 2

Data Parallelism

Run multiple replicas of the model for higher throughput:

vllm serve meta-llama/Llama-2-7b-hf --data-parallel-size 2

Performance Tuning

Chunked Prefill

Enable chunked prefill to process long prompts in chunks:

vllm serve meta-llama/Llama-2-7b-hf --enable-chunked-prefill

This is enabled by default for most models.

Prefix Caching

Enable prefix caching to reuse KV cache for common prompt prefixes:

vllm serve meta-llama/Llama-2-7b-hf --enable-prefix-caching

This can significantly improve throughput for requests with shared prefixes.

Quantization

Use quantized models to reduce memory usage:

vllm serve TheBloke/Llama-2-7B-AWQ --quantization awq

Supported quantization methods:

  • awq: Activation-aware Weight Quantization
  • gptq: GPTQ quantization
  • fp8: 8-bit floating point
  • bitsandbytes: BitsAndBytes quantization

Offline Inference Configuration

When using the LLM class for offline inference, pass engine arguments as parameters:

from vllm import LLM llm = LLM( model="meta-llama/Llama-2-7b-hf", dtype="bfloat16", gpu_memory_utilization=0.85, max_model_len=4096, tensor_parallel_size=2, )

Advanced Options

KV Cache Configuration

—kv-cache-dtype: Data type for KV cache

  • auto: Match model dtype (default)
  • fp8: 8-bit floating point for reduced memory
vllm serve meta-llama/Llama-2-7b-hf --kv-cache-dtype fp8

Scheduling Policy

—scheduling-policy: Request scheduling policy

  • fcfs: First-come, first-served (default)
  • priority: Priority-based scheduling
vllm serve meta-llama/Llama-2-7b-hf --scheduling-policy priority

Attention Backend

—attention-backend: Attention implementation to use

  • FLASH_ATTN: FlashAttention (default on NVIDIA GPUs)
  • FLASHINFER: FlashInfer for better performance on some models
vllm serve meta-llama/Llama-2-7b-hf --attention-backend FLASHINFER

Environment Variables

Some configuration can be set via environment variables:

VLLMUSEMODELSCOPE: Use ModelScope instead of Hugging Face

export VLLM_USE_MODELSCOPE=True vllm serve qwen/Qwen-7B

VLLMAPIKEY: API key for authentication

export VLLM_API_KEY=your-secret-key vllm serve meta-llama/Llama-2-7b-hf

VLLMLOGGINGLEVEL: Set logging level

export VLLM_LOGGING_LEVEL=DEBUG vllm serve meta-llama/Llama-2-7b-hf

Configuration Examples

High Throughput Setup

Optimize for maximum throughput:

vllm serve meta-llama/Llama-2-7b-hf \ --max-num-seqs 512 \ --max-num-batched-tokens 16384 \ --enable-prefix-caching \ --enable-chunked-prefill

Low Latency Setup

Optimize for minimum latency:

vllm serve meta-llama/Llama-2-7b-hf \ --max-num-seqs 64 \ --max-num-batched-tokens 2048 \ --gpu-memory-utilization 0.8

Memory-Constrained Setup

Reduce memory usage:

vllm serve meta-llama/Llama-2-7b-hf \ --gpu-memory-utilization 0.7 \ --max-num-seqs 128 \ --kv-cache-dtype fp8

What’s Next

  • Supported Models: Browse the model families vLLM supports and how to check if your model is compatible.
  • LoRA Adapters: Serve multiple fine-tuned adapters on top of a base model with minimal overhead.
  • OpenAI-Compatible Server: Full reference for the HTTP server’s supported APIs and options.