Skip to Content
modelsLora Adapters

Last Updated: 3/19/2026


LoRA Adapters

LoRA (Low-Rank Adaptation) allows you to fine-tune large language models efficiently by training only a small number of additional parameters. vLLM supports serving multiple LoRA adapters on top of a base model with minimal overhead.

What are LoRA Adapters?

LoRA adapters are small, task-specific parameter updates that can be applied to a base model. Instead of fine-tuning the entire model, LoRA trains low-rank matrices that modify the model’s behavior for specific tasks while keeping the base model frozen.

Benefits of LoRA with vLLM:

  • Efficient serving: Multiple adapters can be served simultaneously
  • Low memory overhead: Adapters are much smaller than full models
  • Per-request selection: Different requests can use different adapters
  • Fast switching: Minimal latency when switching between adapters

Offline Inference with LoRA

Basic Usage

First, download a LoRA adapter from Hugging Face:

from huggingface_hub import snapshot_download sql_lora_path = snapshot_download(repo_id="jeeejeee/llama32-3b-text2sql-spider")

Initialize the base model with LoRA support enabled:

from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM(model="meta-llama/Llama-3.2-3B-Instruct", enable_lora=True)

Generate text using the LoRA adapter:

sampling_params = SamplingParams( temperature=0, max_tokens=256, stop=["[/assistant]"], ) prompts = [ "[user] Write a SQL query to answer the question based on the table schema.\n\n context: CREATE TABLE table_name_74 (icao VARCHAR, airport VARCHAR)\n\n question: Name the ICAO for lilongwe international airport [/user] [assistant]", ] outputs = llm.generate( prompts, sampling_params, lora_request=LoRARequest("sql_adapter", 1, sql_lora_path), ) for output in outputs: print(output.outputs[0].text)

LoRARequest Parameters

The LoRARequest class takes three required parameters:

  • lora_name: Human-readable identifier for the adapter
  • lora_int_id: Globally unique integer ID for the adapter (must be > 0)
  • lora_path: Path to the LoRA adapter directory
lora_request = LoRARequest( lora_name="sql_adapter", lora_int_id=1, lora_path="/path/to/adapter", )

Multiple LoRA Adapters

You can use different LoRA adapters for different requests:

from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM( model="meta-llama/Llama-3.2-3B-Instruct", enable_lora=True, max_loras=2, # Allow up to 2 LoRAs simultaneously ) # Use adapter 1 for SQL generation sql_outputs = llm.generate( ["Generate SQL for..."], SamplingParams(temperature=0), lora_request=LoRARequest("sql_adapter", 1, "/path/to/sql-lora"), ) # Use adapter 2 for code generation code_outputs = llm.generate( ["Write Python code for..."], SamplingParams(temperature=0), lora_request=LoRARequest("code_adapter", 2, "/path/to/code-lora"), )

Serving LoRA Adapters

Starting the Server

Start the vLLM server with LoRA support and pre-load adapters:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider

You can pre-load multiple adapters:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --lora-modules sql-lora=jeeejeee/llama32-3b-text2sql-spider \ --lora-modules code-lora=/path/to/code-adapter

Listing Available Models

Query the /v1/models endpoint to see available models and adapters:

curl http://localhost:8000/v1/models | jq .

Response:

{ "object": "list", "data": [ { "id": "meta-llama/Llama-3.2-3B-Instruct", "object": "model", ... }, { "id": "sql-lora", "object": "model", ... } ] }

Making Requests

Use the LoRA adapter by specifying its name in the model field:

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "sql-lora", "prompt": "Write a SQL query to...", "max_tokens": 100, "temperature": 0 }'

Or with the Python client:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) completion = client.completions.create( model="sql-lora", prompt="Write a SQL query to...", max_tokens=100, temperature=0, ) print(completion.choices[0].text)

Dynamic LoRA Loading

vLLM supports loading and unloading LoRA adapters at runtime without restarting the server.

Enable Dynamic Loading

Set the environment variable:

export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True

Warning: This feature has security implications and should only be used in trusted environments.

Load an Adapter

curl -X POST http://localhost:8000/v1/load_lora_adapter \ -H "Content-Type: application/json" \ -d '{ "lora_name": "new_adapter", "lora_path": "/path/to/new-adapter" }'

Unload an Adapter

curl -X POST http://localhost:8000/v1/unload_lora_adapter \ -H "Content-Type: application/json" \ -d '{ "lora_name": "new_adapter" }'

In-Place Reloading

Replace an existing adapter with updated weights:

curl -X POST http://localhost:8000/v1/load_lora_adapter \ -H "Content-Type: application/json" \ -d '{ "lora_name": "my-adapter", "lora_path": "/path/to/adapter/v2", "load_inplace": true }'

This is useful for continuous training scenarios where adapters are updated frequently.

Configuration Options

Maximum Number of LoRAs

Control how many LoRA adapters can be active simultaneously:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --max-loras 4

Maximum LoRA Rank

Set the maximum rank for LoRA adapters. This should match the highest rank among your adapters:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --max-lora-rank 64

Important: Setting this too high wastes memory. Use the actual maximum rank of your adapters.

Target Modules

Restrict LoRA to specific model modules for better performance:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --lora-target-modules o_proj qkv_proj

This applies LoRA only to the specified modules (e.g., output projection, query-key-value projection).

CPU Offloading

Offload inactive LoRA adapters to CPU memory:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --max-loras 4 \ --max-cpu-loras 8

This allows serving more adapters than fit in GPU memory by swapping them as needed.

Advanced Usage

Specifying Base Model

When loading adapters, you can specify the base model explicitly:

vllm serve meta-llama/Llama-3.2-3B-Instruct \ --enable-lora \ --lora-modules '{"name": "sql-lora", "path": "jeeejeee/llama32-3b-text2sql-spider", "base_model_name": "meta-llama/Llama-3.2-3B-Instruct"}'

This creates a proper lineage in the model card, showing the relationship between base model and adapter.

LoRA for Multimodal Models

vLLM experimentally supports LoRA for vision-language models:

from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest llm = LLM( model="llava-hf/llava-1.5-7b-hf", enable_lora=True, max_lora_rank=64, ) # Use LoRA adapter for multimodal model outputs = llm.generate( prompts, sampling_params, lora_request=LoRARequest("vision_adapter", 1, "/path/to/vision-lora"), )

Best Practices

  1. Match LoRA rank: Set --max-lora-rank to the actual maximum rank of your adapters
  2. Limit active LoRAs: Use --max-loras to control memory usage
  3. Use CPU offloading: Enable --max-cpu-loras for serving many adapters
  4. Target specific modules: Use --lora-target-modules to reduce overhead
  5. Unique IDs: Ensure each LoRA adapter has a unique lora_int_id

What’s Next

  • Supported Models: Verify which base model architectures support LoRA in vLLM.
  • Engine Configuration: Tune memory and concurrency settings when serving multiple LoRA adapters.
  • OpenAI-Compatible Server: Full reference for the HTTP server where LoRA adapters are served alongside the base model.