Last Updated: 3/19/2026

Offline Inference

Offline inference allows you to run batch inference in your own code without starting a server. This is ideal for processing large datasets, running evaluations, or experimenting with models.

Basic Usage

The LLM class is the main interface for offline inference. Here’s a simple example:


from vllm import LLM, SamplingParams
 
# Initialize the LLM
llm = LLM(model="facebook/opt-125m")
 
# Define prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]
 
# Configure sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
 
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
 
# Process results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated: {generated_text!r}")

LLM Class Initialization

The LLM class accepts various parameters to configure the model and engine:


from vllm import LLM
 
llm = LLM(
    model="facebook/opt-125m",
    dtype="auto",                    # Data type: auto, float16, bfloat16, float32
    gpu_memory_utilization=0.9,      # Fraction of GPU memory to use
    max_model_len=2048,              # Maximum sequence length
    tensor_parallel_size=1,          # Number of GPUs for tensor parallelism
    trust_remote_code=False,         # Whether to trust remote code
)

Common Initialization Parameters

model: Model name or path (from Hugging Face or local directory)
dtype: Data type for model weights (auto, float16, bfloat16, float32)
gpu_memory_utilization: Fraction of GPU memory to use (default: 0.9)
max_model_len: Maximum sequence length the model can handle
tensor_parallel_size: Number of GPUs for tensor parallelism
pipeline_parallel_size: Number of pipeline stages for pipeline parallelism
trust_remote_code: Whether to trust and execute remote code from model repos

The generate() Method

The generate() method processes a list of prompts and returns completions:


outputs = llm.generate(prompts, sampling_params)

Input Format

prompts can be:

A single string: "Hello, world!"
A list of strings: ["Hello", "Hi there"]
A list of token IDs: [[1, 2, 3], [4, 5, 6]]

Output Format

The method returns a list of RequestOutput objects. Each RequestOutput contains:

request_id: Unique identifier for the request
prompt: The original prompt string
prompt_token_ids: Token IDs of the prompt
outputs: List of CompletionOutput objects (one per generated sequence)
finished: Whether generation is complete

Each CompletionOutput contains:

index: Index of this output in the request
text: Generated text
token_ids: Token IDs of generated text
cumulative_logprob: Cumulative log probability of the sequence
logprobs: Log probabilities of tokens (if requested)
finish_reason: Why generation stopped (e.g., “length”, “stop”)

The chat() Method

For instruction-tuned or chat models, use the chat() method with message-based input:


from vllm import LLM, SamplingParams
 
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
 
messages_list = [
    [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    [
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
]
 
outputs = llm.chat(messages_list, sampling_params)
 
for output in outputs:
    generated_text = output.outputs[0].text
    print(f"Response: {generated_text}")

The chat() method automatically applies the model’s chat template, ensuring proper formatting for instruction-following models.

The encode() Method

For embedding models, use the encode() method to generate embeddings:


from vllm import LLM
 
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
 
texts = [
    "Hello, world!",
    "How are you?",
    "Machine learning is fascinating.",
]
 
outputs = llm.encode(texts)
 
for output in outputs:
    embedding = output.outputs.embedding
    print(f"Embedding dimension: {len(embedding)}")

SamplingParams Configuration

The SamplingParams class controls text generation behavior:


from vllm import SamplingParams
 
sampling_params = SamplingParams(
    n=1,                          # Number of sequences to generate per prompt
    temperature=0.8,              # Sampling temperature (0.0 = greedy)
    top_p=0.95,                   # Nucleus sampling threshold
    top_k=50,                     # Top-k sampling (0 = disabled)
    max_tokens=100,               # Maximum tokens to generate
    stop=[".", "!", "?"],         # Stop strings
    presence_penalty=0.0,         # Penalize tokens that have appeared
    frequency_penalty=0.0,        # Penalize tokens based on frequency
    repetition_penalty=1.0,       # Penalize repeated tokens
    logprobs=None,                # Number of log probabilities to return
)

Key Sampling Parameters

Temperature: Controls randomness in sampling

0.0: Greedy decoding (deterministic)
0.1-0.7: More focused and coherent
0.8-1.0: Balanced creativity
>1.0: More random and diverse

Top-p (nucleus sampling): Cumulative probability threshold

0.9-0.95: Good balance for most tasks
1.0: Consider all tokens

Top-k: Limit sampling to top-k most likely tokens

0 or -1: Disabled
50-100: Common values for creative tasks

Max tokens: Maximum number of tokens to generate

Default: 16
Set higher for longer outputs

Stop sequences: Strings that stop generation when encountered

Example: stop=[".", "\n\n", "END"]

Multiple Outputs per Prompt

Generate multiple completions for each prompt by setting n > 1:


from vllm import LLM, SamplingParams
 
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(
    n=3,                    # Generate 3 completions per prompt
    temperature=0.9,
    max_tokens=50,
)
 
prompts = ["Once upon a time"]
outputs = llm.generate(prompts, sampling_params)
 
for output in outputs:
    print(f"Prompt: {output.prompt}")
    for i, completion in enumerate(output.outputs):
        print(f"  Completion {i + 1}: {completion.text}")

Log Probabilities

Request log probabilities for generated tokens:


from vllm import LLM, SamplingParams
 
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(
    temperature=0.8,
    logprobs=5,              # Return top 5 log probabilities per token
    prompt_logprobs=5,       # Return log probabilities for prompt tokens
)
 
outputs = llm.generate(["The capital of France is"], sampling_params)
 
for output in outputs:
    # Prompt log probabilities
    if output.prompt_logprobs:
        print("Prompt logprobs:", output.prompt_logprobs)
    
    # Generated token log probabilities
    for completion in output.outputs:
        if completion.logprobs:
            print("Token logprobs:", completion.logprobs)

Loading Models from ModelScope

By default, vLLM downloads models from Hugging Face. To use ModelScope instead:


import os
os.environ["VLLM_USE_MODELSCOPE"] = "True"
 
from vllm import LLM
 
llm = LLM(model="qwen/Qwen-7B")

Distributed Inference

For large models that don’t fit on a single GPU, use tensor parallelism:


from vllm import LLM
 
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,        # Use 4 GPUs
    gpu_memory_utilization=0.95,
)

Or pipeline parallelism:


from vllm import LLM
 
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    pipeline_parallel_size=4,      # Use 4 pipeline stages
)

What’s Next

OpenAI-Compatible Server: Serve models over HTTP with an OpenAI-compatible API for multi-client access.
Engine Configuration: Control memory, concurrency, and model loading with EngineArgs.
Supported Models: See the full list of model families vLLM supports.