Last Updated: 3/19/2026

LoRA Adapters

LoRA (Low-Rank Adaptation) allows you to fine-tune large language models efficiently by training only a small number of additional parameters. vLLM supports serving multiple LoRA adapters on top of a base model with minimal overhead.

What are LoRA Adapters?

LoRA adapters are small, task-specific parameter updates that can be applied to a base model. Instead of fine-tuning the entire model, LoRA trains low-rank matrices that modify the model’s behavior for specific tasks while keeping the base model frozen.

Benefits of LoRA with vLLM:

Efficient serving: Multiple adapters can be served simultaneously
Low memory overhead: Adapters are much smaller than full models
Per-request selection: Different requests can use different adapters
Fast switching: Minimal latency when switching between adapters

Offline Inference with LoRA

Basic Usage

First, download a LoRA adapter from Hugging Face:


from huggingface_hub import snapshot_download
 
sql_lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")

Initialize the base model with LoRA support enabled:


from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", enable_lora=True)

Generate text using the LoRA adapter:


sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256,
    stop=["[/assistant]"],
)
 
prompts = [
    "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
]
 
outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("sql_adapter", 1, sql_lora_path),
)
 
for output in outputs:
    print(output.outputs[0].text)

LoRARequest Parameters

The LoRARequest class takes three required parameters:

lora_name: Human-readable identifier for the adapter
lora_int_id: Globally unique integer ID for the adapter (must be > 0)
lora_path: Path to the LoRA adapter directory


lora_request = LoRARequest(
    lora_name="sql_adapter",
    lora_int_id=1,
    lora_path="/path/to/adapter",
)

Multiple LoRA Adapters

You can use different LoRA adapters for different requests:


from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    enable_lora=True,
    max_loras=2,  # Allow up to 2 LoRAs simultaneously
)
 
# Use adapter 1 for SQL generation
sql_outputs = llm.generate(
    ["Generate SQL for..."],
    SamplingParams(temperature=0),
    lora_request=LoRARequest("sql_adapter", 1, "/path/to/sql-lora"),
)
 
# Use adapter 2 for code generation
code_outputs = llm.generate(
    ["Write Python code for..."],
    SamplingParams(temperature=0),
    lora_request=LoRARequest("code_adapter", 2, "/path/to/code-lora"),
)

Serving LoRA Adapters

Starting the Server

Start the vLLM server with LoRA support and pre-load adapters:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider

You can pre-load multiple adapters:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider \
    --lora-modules code-lora=/path/to/code-adapter

Listing Available Models

Query the /v1/models endpoint to see available models and adapters:


curl http://localhost:8000/v1/models | jq .

Response:


{
    "object": "list",
    "data": [
        {
            "id": "meta-llama/Llama-3.2-3B-Instruct",
            "object": "model",
            ...
        },
        {
            "id": "sql-lora",
            "object": "model",
            ...
        }
    ]
}

Making Requests

Use the LoRA adapter by specifying its name in the model field:


curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "sql-lora",
        "prompt": "Write a SQL query to...",
        "max_tokens": 100,
        "temperature": 0
    }'

Or with the Python client:


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
completion = client.completions.create(
    model="sql-lora",
    prompt="Write a SQL query to...",
    max_tokens=100,
    temperature=0,
)
 
print(completion.choices[0].text)

Dynamic LoRA Loading

vLLM supports loading and unloading LoRA adapters at runtime without restarting the server.

Enable Dynamic Loading

Set the environment variable:


export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

Warning: This feature has security implications and should only be used in trusted environments.

Load an Adapter


curl -X POST http://localhost:8000/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{
        "lora_name": "new_adapter",
        "lora_path": "/path/to/new-adapter"
    }'

Unload an Adapter


curl -X POST http://localhost:8000/v1/unload_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{
        "lora_name": "new_adapter"
    }'

In-Place Reloading

Replace an existing adapter with updated weights:


curl -X POST http://localhost:8000/v1/load_lora_adapter \
    -H "Content-Type: application/json" \
    -d '{
        "lora_name": "my-adapter",
        "lora_path": "/path/to/adapter/v2",
        "load_inplace": true
    }'

This is useful for continuous training scenarios where adapters are updated frequently.

Configuration Options

Maximum Number of LoRAs

Control how many LoRA adapters can be active simultaneously:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --max-loras 4

Maximum LoRA Rank

Set the maximum rank for LoRA adapters. This should match the highest rank among your adapters:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --max-lora-rank 64

Important: Setting this too high wastes memory. Use the actual maximum rank of your adapters.

Target Modules

Restrict LoRA to specific model modules for better performance:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --lora-target-modules o_proj qkv_proj

This applies LoRA only to the specified modules (e.g., output projection, query-key-value projection).

CPU Offloading

Offload inactive LoRA adapters to CPU memory:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --max-loras 4 \
    --max-cpu-loras 8

This allows serving more adapters than fit in GPU memory by swapping them as needed.

Advanced Usage

Specifying Base Model

When loading adapters, you can specify the base model explicitly:


vllm serve meta-llama/Llama-3.2-3B-Instruct \
    --enable-lora \
    --lora-modules '{"name": "sql-lora", "path": "jeeejeee/llama32-3b-text2sql-spider", "base_model_name": "meta-llama/Llama-3.2-3B-Instruct"}'

This creates a proper lineage in the model card, showing the relationship between base model and adapter.

LoRA for Multimodal Models

vLLM experimentally supports LoRA for vision-language models:


from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
 
llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    enable_lora=True,
    max_lora_rank=64,
)
 
# Use LoRA adapter for multimodal model
outputs = llm.generate(
    prompts,
    sampling_params,
    lora_request=LoRARequest("vision_adapter", 1, "/path/to/vision-lora"),
)

Best Practices

Match LoRA rank: Set --max-lora-rank to the actual maximum rank of your adapters
Limit active LoRAs: Use --max-loras to control memory usage
Use CPU offloading: Enable --max-cpu-loras for serving many adapters
Target specific modules: Use --lora-target-modules to reduce overhead
Unique IDs: Ensure each LoRA adapter has a unique lora_int_id

What’s Next

Supported Models: Verify which base model architectures support LoRA in vLLM.
Engine Configuration: Tune memory and concurrency settings when serving multiple LoRA adapters.
OpenAI-Compatible Server: Full reference for the HTTP server where LoRA adapters are served alongside the base model.