Last Updated: 3/19/2026

Engine Configuration

Engine arguments control the behavior of the vLLM engine. These arguments are used both for offline inference (via the LLM class) and online serving (via vllm serve).

Common Configuration Options

Model Loading

—model: Model name or path


vllm serve meta-llama/Llama-2-7b-hf

—dtype: Data type for model weights

auto: Automatically detect from model config (default)
float16: Half precision
bfloat16: Brain floating point
float32: Full precision


vllm serve meta-llama/Llama-2-7b-hf --dtype bfloat16

—max-model-len: Maximum sequence length


vllm serve meta-llama/Llama-2-7b-hf --max-model-len 4096

—trust-remote-code: Allow executing remote code from model repositories


vllm serve Qwen/Qwen-7B --trust-remote-code

Memory Management

—gpu-memory-utilization: Fraction of GPU memory to use for the model (default: 0.9)


vllm serve meta-llama/Llama-2-7b-hf --gpu-memory-utilization 0.85

Lower values leave more memory for other processes but may reduce throughput.

—max-num-seqs: Maximum number of sequences to process in a single batch (default: 256)


vllm serve meta-llama/Llama-2-7b-hf --max-num-seqs 128

—max-num-batched-tokens: Maximum number of tokens to process in a single batch (default: varies by hardware)


vllm serve meta-llama/Llama-2-7b-hf --max-num-batched-tokens 8192

Serving Options

—host: Server host address (default: localhost)


vllm serve meta-llama/Llama-2-7b-hf --host 0.0.0.0

—port: Server port (default: 8000)


vllm serve meta-llama/Llama-2-7b-hf --port 8080

—api-key: API key for authentication (can be specified multiple times)


vllm serve meta-llama/Llama-2-7b-hf --api-key secret-key-1 --api-key secret-key-2

Distributed Inference

Tensor Parallelism

Split the model across multiple GPUs:


vllm serve meta-llama/Llama-2-70b-hf --tensor-parallel-size 4

This splits the model’s layers across 4 GPUs.

Pipeline Parallelism

Split the model into stages across multiple GPUs:


vllm serve meta-llama/Llama-2-70b-hf --pipeline-parallel-size 2

Data Parallelism

Run multiple replicas of the model for higher throughput:


vllm serve meta-llama/Llama-2-7b-hf --data-parallel-size 2

Performance Tuning

Chunked Prefill

Enable chunked prefill to process long prompts in chunks:


vllm serve meta-llama/Llama-2-7b-hf --enable-chunked-prefill

This is enabled by default for most models.

Prefix Caching

Enable prefix caching to reuse KV cache for common prompt prefixes:


vllm serve meta-llama/Llama-2-7b-hf --enable-prefix-caching

This can significantly improve throughput for requests with shared prefixes.

Quantization

Use quantized models to reduce memory usage:


vllm serve TheBloke/Llama-2-7B-AWQ --quantization awq

Supported quantization methods:

awq: Activation-aware Weight Quantization
gptq: GPTQ quantization
fp8: 8-bit floating point
bitsandbytes: BitsAndBytes quantization

Offline Inference Configuration

When using the LLM class for offline inference, pass engine arguments as parameters:


from vllm import LLM
 
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    dtype="bfloat16",
    gpu_memory_utilization=0.85,
    max_model_len=4096,
    tensor_parallel_size=2,
)

Advanced Options

KV Cache Configuration

—kv-cache-dtype: Data type for KV cache

auto: Match model dtype (default)
fp8: 8-bit floating point for reduced memory


vllm serve meta-llama/Llama-2-7b-hf --kv-cache-dtype fp8

Scheduling Policy

—scheduling-policy: Request scheduling policy

fcfs: First-come, first-served (default)
priority: Priority-based scheduling


vllm serve meta-llama/Llama-2-7b-hf --scheduling-policy priority

Attention Backend

—attention-backend: Attention implementation to use

FLASH_ATTN: FlashAttention (default on NVIDIA GPUs)
FLASHINFER: FlashInfer for better performance on some models


vllm serve meta-llama/Llama-2-7b-hf --attention-backend FLASHINFER

Environment Variables

Some configuration can be set via environment variables:

VLLMUSEMODELSCOPE: Use ModelScope instead of Hugging Face


export VLLM_USE_MODELSCOPE=True
vllm serve qwen/Qwen-7B

VLLMAPIKEY: API key for authentication


export VLLM_API_KEY=your-secret-key
vllm serve meta-llama/Llama-2-7b-hf

VLLMLOGGINGLEVEL: Set logging level


export VLLM_LOGGING_LEVEL=DEBUG
vllm serve meta-llama/Llama-2-7b-hf

Configuration Examples

High Throughput Setup

Optimize for maximum throughput:


vllm serve meta-llama/Llama-2-7b-hf \
  --max-num-seqs 512 \
  --max-num-batched-tokens 16384 \
  --enable-prefix-caching \
  --enable-chunked-prefill

Low Latency Setup

Optimize for minimum latency:


vllm serve meta-llama/Llama-2-7b-hf \
  --max-num-seqs 64 \
  --max-num-batched-tokens 2048 \
  --gpu-memory-utilization 0.8

Memory-Constrained Setup

Reduce memory usage:


vllm serve meta-llama/Llama-2-7b-hf \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 128 \
  --kv-cache-dtype fp8

What’s Next

Supported Models: Browse the model families vLLM supports and how to check if your model is compatible.
LoRA Adapters: Serve multiple fine-tuned adapters on top of a base model with minimal overhead.
OpenAI-Compatible Server: Full reference for the HTTP server’s supported APIs and options.