Last Updated: 3/19/2026

Supported Models

vLLM supports a wide range of generative and embedding models across various model families and architectures.

Checking Model Support

The easiest way to check if your model is supported is to try loading it:


from vllm import LLM
 
# For generative models
llm = LLM(model="meta-llama/Llama-2-7b-hf")
output = llm.generate("Hello, my name is")
print(output)
 
# For embedding models
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
output = llm.encode("Hello, world!")
print(output)

If vLLM successfully returns output, your model is supported.

Model Families

Generative Models

vLLM natively supports the following popular model families for text generation:

Llama Family:

Llama 3.1, Llama 3, Llama 2, LLaMA
Examples: meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Llama-2-7b-hf

Qwen Family:

Qwen2, Qwen2.5, Qwen3
Examples: Qwen/Qwen2-7B-Instruct, Qwen/Qwen2.5-7B-Instruct

Mistral Family:

Mistral, Mixtral
Examples: mistralai/Mistral-7B-Instruct-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1

Gemma Family:

Gemma, Gemma 2
Examples: google/gemma-2-9b-it, google/gemma-7b-it

DeepSeek Family:

DeepSeek, DeepSeek-V2, DeepSeek-V3
Examples: deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1

Other Popular Models:

GPT-2, GPT-J, GPT-NeoX
OPT, BLOOM
Falcon, MPT
Phi, Phi-3
Yi, Baichuan
InternLM, ChatGLM

For a complete list of supported architectures, see the vLLM documentation .

Embedding Models

vLLM supports embedding models for semantic search and retrieval:

Popular Embedding Models:

BAAI/bge-large-en-v1.5
BAAI/bge-base-en-v1.5
intfloat/e5-mistral-7b-instruct
Snowflake/snowflake-arctic-embed-m-v1.5

To use an embedding model, specify task="embed":


from vllm import LLM
 
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
embeddings = llm.encode(["Hello, world!", "How are you?"])

Multimodal Models

vLLM supports vision-language models that can process both text and images:

Popular Multimodal Models:

LLaVA: llava-hf/llava-1.5-7b-hf
Qwen2-VL: Qwen/Qwen2-VL-7B-Instruct
InternVL: OpenGVLab/InternVL2-8B
Phi-3-Vision: microsoft/Phi-3-vision-128k-instruct

Loading Models

From Hugging Face Hub

By default, vLLM downloads models from Hugging Face:


from vllm import LLM
 
llm = LLM(model="meta-llama/Llama-2-7b-hf")

Models are automatically cached in ~/.cache/huggingface/hub/.

From ModelScope

To use models from ModelScope instead:


export VLLM_USE_MODELSCOPE=True


from vllm import LLM
 
llm = LLM(model="qwen/Qwen-7B", trust_remote_code=True)

From Local Path

Load a model from a local directory:


from vllm import LLM
 
llm = LLM(model="/path/to/model")

Model Architectures

Decoder-Only Models

Most generative models use a decoder-only architecture (like GPT):

Llama, Mistral, Qwen, Gemma
Optimized for text generation
Support for various decoding strategies

Encoder-Decoder Models

Some models use an encoder-decoder architecture:

BART, T5
Supported via plugins
Useful for translation and summarization

Mixture-of-Experts (MoE)

vLLM supports MoE models for efficient scaling:

Mixtral: mistralai/Mixtral-8x7B-Instruct-v0.1
DeepSeek-V2: deepseek-ai/DeepSeek-V2
Qwen2-MoE: Qwen/Qwen1.5-MoE-A2.7B

MoE models activate only a subset of parameters per token, enabling larger models with lower compute requirements.

Model Compatibility

Quantization

vLLM supports various quantization methods to reduce memory usage:

Supported Quantization Methods:

AWQ: TheBloke/Llama-2-7B-AWQ
GPTQ: TheBloke/Llama-2-7B-GPTQ
FP8: Native FP8 models
BitsAndBytes: 4-bit and 8-bit quantization

Example with quantized model:


from vllm import LLM
 
llm = LLM(model="TheBloke/Llama-2-7B-AWQ", quantization="awq")

Trust Remote Code

Some models require executing remote code from the model repository:


from vllm import LLM
 
llm = LLM(model="Qwen/Qwen-7B", trust_remote_code=True)

Use this flag only with models from trusted sources.

Checking Model Architecture

To determine if a model is supported, check its config.json file on Hugging Face. Look for the "architectures" field:


{
  "architectures": ["LlamaForCausalLM"],
  ...
}

If the architecture is listed in vLLM’s supported models, it should work.

Transformers Backend

vLLM also supports models through the Transformers modeling backend. This allows running models that aren’t natively implemented in vLLM:


from vllm import LLM
 
llm = LLM(model="your-model", model_impl="transformers")

The Transformers backend provides:

Support for encoder-only, decoder-only, and MoE architectures
Compatibility with most Transformers features
Performance within 5% of native vLLM implementations

Model Support Policy

vLLM follows a community-driven approach to model support:

Community Contributions: We welcome PRs for new models
Best-Effort Consistency: We aim for functional models with sensible outputs
Issue Resolution: Report bugs via GitHub issues
Monitoring: Track changes in the vllm/model_executor/models/ directory

Models are tested at different levels:

Strict Consistency: Output matches Hugging Face Transformers
Output Sensibility: Output is coherent and reasonable
Runtime Functionality: Model loads and runs without errors
Community Feedback: Relies on user reports

What’s Next

Installation: Install vLLM on your platform of choice.
Quick Start: Run your first inference with a supported model in under 5 minutes.
LoRA Adapters: Fine-tune model behavior with LoRA adapters on top of any supported base model.