Skip to Content
modelsSupported Models

Last Updated: 3/19/2026


Supported Models

vLLM supports a wide range of generative and embedding models across various model families and architectures.

Checking Model Support

The easiest way to check if your model is supported is to try loading it:

from vllm import LLM # For generative models llm = LLM(model="meta-llama/Llama-2-7b-hf") output = llm.generate("Hello, my name is") print(output) # For embedding models llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed") output = llm.encode("Hello, world!") print(output)

If vLLM successfully returns output, your model is supported.

Model Families

Generative Models

vLLM natively supports the following popular model families for text generation:

Llama Family:

  • Llama 3.1, Llama 3, Llama 2, LLaMA
  • Examples: meta-llama/Meta-Llama-3.1-70B-Instruct, meta-llama/Llama-2-7b-hf

Qwen Family:

  • Qwen2, Qwen2.5, Qwen3
  • Examples: Qwen/Qwen2-7B-Instruct, Qwen/Qwen2.5-7B-Instruct

Mistral Family:

  • Mistral, Mixtral
  • Examples: mistralai/Mistral-7B-Instruct-v0.1, mistralai/Mixtral-8x7B-Instruct-v0.1

Gemma Family:

  • Gemma, Gemma 2
  • Examples: google/gemma-2-9b-it, google/gemma-7b-it

DeepSeek Family:

  • DeepSeek, DeepSeek-V2, DeepSeek-V3
  • Examples: deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1

Other Popular Models:

  • GPT-2, GPT-J, GPT-NeoX
  • OPT, BLOOM
  • Falcon, MPT
  • Phi, Phi-3
  • Yi, Baichuan
  • InternLM, ChatGLM

For a complete list of supported architectures, see the vLLM documentation .

Embedding Models

vLLM supports embedding models for semantic search and retrieval:

Popular Embedding Models:

  • BAAI/bge-large-en-v1.5
  • BAAI/bge-base-en-v1.5
  • intfloat/e5-mistral-7b-instruct
  • Snowflake/snowflake-arctic-embed-m-v1.5

To use an embedding model, specify task="embed":

from vllm import LLM llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed") embeddings = llm.encode(["Hello, world!", "How are you?"])

Multimodal Models

vLLM supports vision-language models that can process both text and images:

Popular Multimodal Models:

  • LLaVA: llava-hf/llava-1.5-7b-hf
  • Qwen2-VL: Qwen/Qwen2-VL-7B-Instruct
  • InternVL: OpenGVLab/InternVL2-8B
  • Phi-3-Vision: microsoft/Phi-3-vision-128k-instruct

Loading Models

From Hugging Face Hub

By default, vLLM downloads models from Hugging Face:

from vllm import LLM llm = LLM(model="meta-llama/Llama-2-7b-hf")

Models are automatically cached in ~/.cache/huggingface/hub/.

From ModelScope

To use models from ModelScope instead:

export VLLM_USE_MODELSCOPE=True
from vllm import LLM llm = LLM(model="qwen/Qwen-7B", trust_remote_code=True)

From Local Path

Load a model from a local directory:

from vllm import LLM llm = LLM(model="/path/to/model")

Model Architectures

Decoder-Only Models

Most generative models use a decoder-only architecture (like GPT):

  • Llama, Mistral, Qwen, Gemma
  • Optimized for text generation
  • Support for various decoding strategies

Encoder-Decoder Models

Some models use an encoder-decoder architecture:

  • BART, T5
  • Supported via plugins
  • Useful for translation and summarization

Mixture-of-Experts (MoE)

vLLM supports MoE models for efficient scaling:

  • Mixtral: mistralai/Mixtral-8x7B-Instruct-v0.1
  • DeepSeek-V2: deepseek-ai/DeepSeek-V2
  • Qwen2-MoE: Qwen/Qwen1.5-MoE-A2.7B

MoE models activate only a subset of parameters per token, enabling larger models with lower compute requirements.

Model Compatibility

Quantization

vLLM supports various quantization methods to reduce memory usage:

Supported Quantization Methods:

  • AWQ: TheBloke/Llama-2-7B-AWQ
  • GPTQ: TheBloke/Llama-2-7B-GPTQ
  • FP8: Native FP8 models
  • BitsAndBytes: 4-bit and 8-bit quantization

Example with quantized model:

from vllm import LLM llm = LLM(model="TheBloke/Llama-2-7B-AWQ", quantization="awq")

Trust Remote Code

Some models require executing remote code from the model repository:

from vllm import LLM llm = LLM(model="Qwen/Qwen-7B", trust_remote_code=True)

Use this flag only with models from trusted sources.

Checking Model Architecture

To determine if a model is supported, check its config.json file on Hugging Face. Look for the "architectures" field:

{ "architectures": ["LlamaForCausalLM"], ... }

If the architecture is listed in vLLM’s supported models, it should work.

Transformers Backend

vLLM also supports models through the Transformers modeling backend. This allows running models that aren’t natively implemented in vLLM:

from vllm import LLM llm = LLM(model="your-model", model_impl="transformers")

The Transformers backend provides:

  • Support for encoder-only, decoder-only, and MoE architectures
  • Compatibility with most Transformers features
  • Performance within 5% of native vLLM implementations

Model Support Policy

vLLM follows a community-driven approach to model support:

  1. Community Contributions: We welcome PRs for new models
  2. Best-Effort Consistency: We aim for functional models with sensible outputs
  3. Issue Resolution: Report bugs via GitHub issues
  4. Monitoring: Track changes in the vllm/model_executor/models/ directory

Models are tested at different levels:

  • Strict Consistency: Output matches Hugging Face Transformers
  • Output Sensibility: Output is coherent and reasonable
  • Runtime Functionality: Model loads and runs without errors
  • Community Feedback: Relies on user reports

What’s Next

  • Installation: Install vLLM on your platform of choice.
  • Quick Start: Run your first inference with a supported model in under 5 minutes.
  • LoRA Adapters: Fine-tune model behavior with LoRA adapters on top of any supported base model.