Last Updated: 3/19/2026
Supported Models
vLLM supports a wide range of generative and embedding models across various model families and architectures.
Checking Model Support
The easiest way to check if your model is supported is to try loading it:
from vllm import LLM
# For generative models
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("Hello, my name is")
print(output)
# For embedding models
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
output = llm.encode("Hello, world!")
print(output)If vLLM successfully returns output, your model is supported.
Model Families
Generative Models
vLLM natively supports the following popular model families for text generation:
Llama Family:
- Llama 3.1, Llama 3, Llama 2, LLaMA
- Examples:
meta-llama/Meta-Llama-3.1-70B-Instruct,meta-llama/Llama-2-7b-hf
Qwen Family:
- Qwen2, Qwen2.5, Qwen3
- Examples:
Qwen/Qwen2-7B-Instruct,Qwen/Qwen2.5-7B-Instruct
Mistral Family:
- Mistral, Mixtral
- Examples:
mistralai/Mistral-7B-Instruct-v0.1,mistralai/Mixtral-8x7B-Instruct-v0.1
Gemma Family:
- Gemma, Gemma 2
- Examples:
google/gemma-2-9b-it,google/gemma-7b-it
DeepSeek Family:
- DeepSeek, DeepSeek-V2, DeepSeek-V3
- Examples:
deepseek-ai/DeepSeek-V3,deepseek-ai/DeepSeek-R1
Other Popular Models:
- GPT-2, GPT-J, GPT-NeoX
- OPT, BLOOM
- Falcon, MPT
- Phi, Phi-3
- Yi, Baichuan
- InternLM, ChatGLM
For a complete list of supported architectures, see the vLLM documentation .
Embedding Models
vLLM supports embedding models for semantic search and retrieval:
Popular Embedding Models:
BAAI/bge-large-en-v1.5BAAI/bge-base-en-v1.5intfloat/e5-mistral-7b-instructSnowflake/snowflake-arctic-embed-m-v1.5
To use an embedding model, specify task="embed":
from vllm import LLM
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
embeddings = llm.encode(["Hello, world!", "How are you?"])Multimodal Models
vLLM supports vision-language models that can process both text and images:
Popular Multimodal Models:
- LLaVA:
llava-hf/llava-1.5-7b-hf - Qwen2-VL:
Qwen/Qwen2-VL-7B-Instruct - InternVL:
OpenGVLab/InternVL2-8B - Phi-3-Vision:
microsoft/Phi-3-vision-128k-instruct
Loading Models
From Hugging Face Hub
By default, vLLM downloads models from Hugging Face:
from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b-hf")Models are automatically cached in ~/.cache/huggingface/hub/.
From ModelScope
To use models from ModelScope instead:
export VLLM_USE_MODELSCOPE=Truefrom vllm import LLM
llm = LLM(model="qwen/Qwen-7B", trust_remote_code=True)From Local Path
Load a model from a local directory:
from vllm import LLM
llm = LLM(model="/path/to/model")Model Architectures
Decoder-Only Models
Most generative models use a decoder-only architecture (like GPT):
- Llama, Mistral, Qwen, Gemma
- Optimized for text generation
- Support for various decoding strategies
Encoder-Decoder Models
Some models use an encoder-decoder architecture:
- BART, T5
- Supported via plugins
- Useful for translation and summarization
Mixture-of-Experts (MoE)
vLLM supports MoE models for efficient scaling:
- Mixtral:
mistralai/Mixtral-8x7B-Instruct-v0.1 - DeepSeek-V2:
deepseek-ai/DeepSeek-V2 - Qwen2-MoE:
Qwen/Qwen1.5-MoE-A2.7B
MoE models activate only a subset of parameters per token, enabling larger models with lower compute requirements.
Model Compatibility
Quantization
vLLM supports various quantization methods to reduce memory usage:
Supported Quantization Methods:
- AWQ:
TheBloke/Llama-2-7B-AWQ - GPTQ:
TheBloke/Llama-2-7B-GPTQ - FP8: Native FP8 models
- BitsAndBytes: 4-bit and 8-bit quantization
Example with quantized model:
from vllm import LLM
llm = LLM(model="TheBloke/Llama-2-7B-AWQ", quantization="awq")Trust Remote Code
Some models require executing remote code from the model repository:
from vllm import LLM
llm = LLM(model="Qwen/Qwen-7B", trust_remote_code=True)Use this flag only with models from trusted sources.
Checking Model Architecture
To determine if a model is supported, check its config.json file on Hugging Face. Look for the "architectures" field:
{
"architectures": ["LlamaForCausalLM"],
...
}If the architecture is listed in vLLM’s supported models, it should work.
Transformers Backend
vLLM also supports models through the Transformers modeling backend. This allows running models that aren’t natively implemented in vLLM:
from vllm import LLM
llm = LLM(model="your-model", model_impl="transformers")The Transformers backend provides:
- Support for encoder-only, decoder-only, and MoE architectures
- Compatibility with most Transformers features
- Performance within 5% of native vLLM implementations
Model Support Policy
vLLM follows a community-driven approach to model support:
- Community Contributions: We welcome PRs for new models
- Best-Effort Consistency: We aim for functional models with sensible outputs
- Issue Resolution: Report bugs via GitHub issues
- Monitoring: Track changes in the
vllm/model_executor/models/directory
Models are tested at different levels:
- Strict Consistency: Output matches Hugging Face Transformers
- Output Sensibility: Output is coherent and reasonable
- Runtime Functionality: Model loads and runs without errors
- Community Feedback: Relies on user reports
What’s Next
- Installation: Install vLLM on your platform of choice.
- Quick Start: Run your first inference with a supported model in under 5 minutes.
- LoRA Adapters: Fine-tune model behavior with LoRA adapters on top of any supported base model.