Last Updated: 3/19/2026
LoRA Adapters
LoRA (Low-Rank Adaptation) allows you to fine-tune large language models efficiently by training only a small number of additional parameters. vLLM supports serving multiple LoRA adapters on top of a base model with minimal overhead.
What are LoRA Adapters?
LoRA adapters are small, task-specific parameter updates that can be applied to a base model. Instead of fine-tuning the entire model, LoRA trains low-rank matrices that modify the model’s behavior for specific tasks while keeping the base model frozen.
Benefits of LoRA with vLLM:
- Efficient serving: Multiple adapters can be served simultaneously
- Low memory overhead: Adapters are much smaller than full models
- Per-request selection: Different requests can use different adapters
- Fast switching: Minimal latency when switching between adapters
Offline Inference with LoRA
Basic Usage
First, download a LoRA adapter from Hugging Face:
from huggingface_hub import snapshot_download
sql_lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")Initialize the base model with LoRA support enabled:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", enable_lora=True)Generate text using the LoRA adapter:
sampling_params = SamplingParams(
temperature=0,
max_tokens=256,
stop=["[/assistant]"],
)
prompts = [
"[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]",
]
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("sql_adapter", 1, sql_lora_path),
)
for output in outputs:
print(output.outputs[0].text)LoRARequest Parameters
The LoRARequest class takes three required parameters:
- lora_name: Human-readable identifier for the adapter
- lora_int_id: Globally unique integer ID for the adapter (must be > 0)
- lora_path: Path to the LoRA adapter directory
lora_request = LoRARequest(
lora_name="sql_adapter",
lora_int_id=1,
lora_path="/path/to/adapter",
)Multiple LoRA Adapters
You can use different LoRA adapters for different requests:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="meta-llama/Llama-3.2-3B-Instruct",
enable_lora=True,
max_loras=2, # Allow up to 2 LoRAs simultaneously
)
# Use adapter 1 for SQL generation
sql_outputs = llm.generate(
["Generate SQL for..."],
SamplingParams(temperature=0),
lora_request=LoRARequest("sql_adapter", 1, "/path/to/sql-lora"),
)
# Use adapter 2 for code generation
code_outputs = llm.generate(
["Write Python code for..."],
SamplingParams(temperature=0),
lora_request=LoRARequest("code_adapter", 2, "/path/to/code-lora"),
)Serving LoRA Adapters
Starting the Server
Start the vLLM server with LoRA support and pre-load adapters:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spiderYou can pre-load multiple adapters:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider \
--lora-modules code-lora=/path/to/code-adapterListing Available Models
Query the /v1/models endpoint to see available models and adapters:
curl http://localhost:8000/v1/models | jq .Response:
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-3.2-3B-Instruct",
"object": "model",
...
},
{
"id": "sql-lora",
"object": "model",
...
}
]
}Making Requests
Use the LoRA adapter by specifying its name in the model field:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sql-lora",
"prompt": "Write a SQL query to...",
"max_tokens": 100,
"temperature": 0
}'Or with the Python client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.completions.create(
model="sql-lora",
prompt="Write a SQL query to...",
max_tokens=100,
temperature=0,
)
print(completion.choices[0].text)Dynamic LoRA Loading
vLLM supports loading and unloading LoRA adapters at runtime without restarting the server.
Enable Dynamic Loading
Set the environment variable:
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=TrueWarning: This feature has security implications and should only be used in trusted environments.
Load an Adapter
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "new_adapter",
"lora_path": "/path/to/new-adapter"
}'Unload an Adapter
curl -X POST http://localhost:8000/v1/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "new_adapter"
}'In-Place Reloading
Replace an existing adapter with updated weights:
curl -X POST http://localhost:8000/v1/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{
"lora_name": "my-adapter",
"lora_path": "/path/to/adapter/v2",
"load_inplace": true
}'This is useful for continuous training scenarios where adapters are updated frequently.
Configuration Options
Maximum Number of LoRAs
Control how many LoRA adapters can be active simultaneously:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--max-loras 4Maximum LoRA Rank
Set the maximum rank for LoRA adapters. This should match the highest rank among your adapters:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--max-lora-rank 64Important: Setting this too high wastes memory. Use the actual maximum rank of your adapters.
Target Modules
Restrict LoRA to specific model modules for better performance:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--lora-target-modules o_proj qkv_projThis applies LoRA only to the specified modules (e.g., output projection, query-key-value projection).
CPU Offloading
Offload inactive LoRA adapters to CPU memory:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--max-loras 4 \
--max-cpu-loras 8This allows serving more adapters than fit in GPU memory by swapping them as needed.
Advanced Usage
Specifying Base Model
When loading adapters, you can specify the base model explicitly:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--enable-lora \
--lora-modules '{"name": "sql-lora", "path": "jeeejeee/llama32-3b-text2sql-spider", "base_model_name": "meta-llama/Llama-3.2-3B-Instruct"}'This creates a proper lineage in the model card, showing the relationship between base model and adapter.
LoRA for Multimodal Models
vLLM experimentally supports LoRA for vision-language models:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
enable_lora=True,
max_lora_rank=64,
)
# Use LoRA adapter for multimodal model
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest("vision_adapter", 1, "/path/to/vision-lora"),
)Best Practices
- Match LoRA rank: Set
--max-lora-rankto the actual maximum rank of your adapters - Limit active LoRAs: Use
--max-lorasto control memory usage - Use CPU offloading: Enable
--max-cpu-lorasfor serving many adapters - Target specific modules: Use
--lora-target-modulesto reduce overhead - Unique IDs: Ensure each LoRA adapter has a unique
lora_int_id
What’s Next
- Supported Models: Verify which base model architectures support LoRA in vLLM.
- Engine Configuration: Tune memory and concurrency settings when serving multiple LoRA adapters.
- OpenAI-Compatible Server: Full reference for the HTTP server where LoRA adapters are served alongside the base model.