Last Updated: 3/19/2026
Engine Configuration
Engine arguments control the behavior of the vLLM engine. These arguments are used both for offline inference (via the LLM class) and online serving (via vllm serve).
Common Configuration Options
Model Loading
—model: Model name or path
vllm serve meta-llama/Llama-2-7b-hf—dtype: Data type for model weights
auto: Automatically detect from model config (default)float16: Half precisionbfloat16: Brain floating pointfloat32: Full precision
vllm serve meta-llama/Llama-2-7b-hf --dtype bfloat16—max-model-len: Maximum sequence length
vllm serve meta-llama/Llama-2-7b-hf --max-model-len 4096—trust-remote-code: Allow executing remote code from model repositories
vllm serve Qwen/Qwen-7B --trust-remote-codeMemory Management
—gpu-memory-utilization: Fraction of GPU memory to use for the model (default: 0.9)
vllm serve meta-llama/Llama-2-7b-hf --gpu-memory-utilization 0.85Lower values leave more memory for other processes but may reduce throughput.
—max-num-seqs: Maximum number of sequences to process in a single batch (default: 256)
vllm serve meta-llama/Llama-2-7b-hf --max-num-seqs 128—max-num-batched-tokens: Maximum number of tokens to process in a single batch (default: varies by hardware)
vllm serve meta-llama/Llama-2-7b-hf --max-num-batched-tokens 8192Serving Options
—host: Server host address (default: localhost)
vllm serve meta-llama/Llama-2-7b-hf --host 0.0.0.0—port: Server port (default: 8000)
vllm serve meta-llama/Llama-2-7b-hf --port 8080—api-key: API key for authentication (can be specified multiple times)
vllm serve meta-llama/Llama-2-7b-hf --api-key secret-key-1 --api-key secret-key-2Distributed Inference
Tensor Parallelism
Split the model across multiple GPUs:
vllm serve meta-llama/Llama-2-70b-hf --tensor-parallel-size 4This splits the model’s layers across 4 GPUs.
Pipeline Parallelism
Split the model into stages across multiple GPUs:
vllm serve meta-llama/Llama-2-70b-hf --pipeline-parallel-size 2Data Parallelism
Run multiple replicas of the model for higher throughput:
vllm serve meta-llama/Llama-2-7b-hf --data-parallel-size 2Performance Tuning
Chunked Prefill
Enable chunked prefill to process long prompts in chunks:
vllm serve meta-llama/Llama-2-7b-hf --enable-chunked-prefillThis is enabled by default for most models.
Prefix Caching
Enable prefix caching to reuse KV cache for common prompt prefixes:
vllm serve meta-llama/Llama-2-7b-hf --enable-prefix-cachingThis can significantly improve throughput for requests with shared prefixes.
Quantization
Use quantized models to reduce memory usage:
vllm serve TheBloke/Llama-2-7B-AWQ --quantization awqSupported quantization methods:
awq: Activation-aware Weight Quantizationgptq: GPTQ quantizationfp8: 8-bit floating pointbitsandbytes: BitsAndBytes quantization
Offline Inference Configuration
When using the LLM class for offline inference, pass engine arguments as parameters:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
dtype="bfloat16",
gpu_memory_utilization=0.85,
max_model_len=4096,
tensor_parallel_size=2,
)Advanced Options
KV Cache Configuration
—kv-cache-dtype: Data type for KV cache
auto: Match model dtype (default)fp8: 8-bit floating point for reduced memory
vllm serve meta-llama/Llama-2-7b-hf --kv-cache-dtype fp8Scheduling Policy
—scheduling-policy: Request scheduling policy
fcfs: First-come, first-served (default)priority: Priority-based scheduling
vllm serve meta-llama/Llama-2-7b-hf --scheduling-policy priorityAttention Backend
—attention-backend: Attention implementation to use
FLASH_ATTN: FlashAttention (default on NVIDIA GPUs)FLASHINFER: FlashInfer for better performance on some models
vllm serve meta-llama/Llama-2-7b-hf --attention-backend FLASHINFEREnvironment Variables
Some configuration can be set via environment variables:
VLLMUSEMODELSCOPE: Use ModelScope instead of Hugging Face
export VLLM_USE_MODELSCOPE=True
vllm serve qwen/Qwen-7BVLLMAPIKEY: API key for authentication
export VLLM_API_KEY=your-secret-key
vllm serve meta-llama/Llama-2-7b-hfVLLMLOGGINGLEVEL: Set logging level
export VLLM_LOGGING_LEVEL=DEBUG
vllm serve meta-llama/Llama-2-7b-hfConfiguration Examples
High Throughput Setup
Optimize for maximum throughput:
vllm serve meta-llama/Llama-2-7b-hf \
--max-num-seqs 512 \
--max-num-batched-tokens 16384 \
--enable-prefix-caching \
--enable-chunked-prefillLow Latency Setup
Optimize for minimum latency:
vllm serve meta-llama/Llama-2-7b-hf \
--max-num-seqs 64 \
--max-num-batched-tokens 2048 \
--gpu-memory-utilization 0.8Memory-Constrained Setup
Reduce memory usage:
vllm serve meta-llama/Llama-2-7b-hf \
--gpu-memory-utilization 0.7 \
--max-num-seqs 128 \
--kv-cache-dtype fp8What’s Next
- Supported Models: Browse the model families vLLM supports and how to check if your model is compatible.
- LoRA Adapters: Serve multiple fine-tuned adapters on top of a base model with minimal overhead.
- OpenAI-Compatible Server: Full reference for the HTTP server’s supported APIs and options.