Last Updated: 3/19/2026
Offline Inference
Offline inference allows you to run batch inference in your own code without starting a server. This is ideal for processing large datasets, running evaluations, or experimenting with models.
Basic Usage
The LLM class is the main interface for offline inference. Here’s a simple example:
from vllm import LLM, SamplingParams
# Initialize the LLM
llm = LLM(model="facebook/opt-125m")
# Define prompts
prompts = [
"Hello, my name is",
"The capital of France is",
"The future of AI is",
]
# Configure sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Process results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated: {generated_text!r}")LLM Class Initialization
The LLM class accepts various parameters to configure the model and engine:
from vllm import LLM
llm = LLM(
model="facebook/opt-125m",
dtype="auto", # Data type: auto, float16, bfloat16, float32
gpu_memory_utilization=0.9, # Fraction of GPU memory to use
max_model_len=2048, # Maximum sequence length
tensor_parallel_size=1, # Number of GPUs for tensor parallelism
trust_remote_code=False, # Whether to trust remote code
)Common Initialization Parameters
- model: Model name or path (from Hugging Face or local directory)
- dtype: Data type for model weights (auto, float16, bfloat16, float32)
- gpu_memory_utilization: Fraction of GPU memory to use (default: 0.9)
- max_model_len: Maximum sequence length the model can handle
- tensor_parallel_size: Number of GPUs for tensor parallelism
- pipeline_parallel_size: Number of pipeline stages for pipeline parallelism
- trust_remote_code: Whether to trust and execute remote code from model repos
The generate() Method
The generate() method processes a list of prompts and returns completions:
outputs = llm.generate(prompts, sampling_params)Input Format
prompts can be:
- A single string:
"Hello, world!" - A list of strings:
["Hello", "Hi there"] - A list of token IDs:
[[1, 2, 3], [4, 5, 6]]
Output Format
The method returns a list of RequestOutput objects. Each RequestOutput contains:
- request_id: Unique identifier for the request
- prompt: The original prompt string
- prompt_token_ids: Token IDs of the prompt
- outputs: List of
CompletionOutputobjects (one per generated sequence) - finished: Whether generation is complete
Each CompletionOutput contains:
- index: Index of this output in the request
- text: Generated text
- token_ids: Token IDs of generated text
- cumulative_logprob: Cumulative log probability of the sequence
- logprobs: Log probabilities of tokens (if requested)
- finish_reason: Why generation stopped (e.g., “length”, “stop”)
The chat() Method
For instruction-tuned or chat models, use the chat() method with message-based input:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
messages_list = [
[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
[
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
]
outputs = llm.chat(messages_list, sampling_params)
for output in outputs:
generated_text = output.outputs[0].text
print(f"Response: {generated_text}")The chat() method automatically applies the model’s chat template, ensuring proper formatting for instruction-following models.
The encode() Method
For embedding models, use the encode() method to generate embeddings:
from vllm import LLM
llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed")
texts = [
"Hello, world!",
"How are you?",
"Machine learning is fascinating.",
]
outputs = llm.encode(texts)
for output in outputs:
embedding = output.outputs.embedding
print(f"Embedding dimension: {len(embedding)}")SamplingParams Configuration
The SamplingParams class controls text generation behavior:
from vllm import SamplingParams
sampling_params = SamplingParams(
n=1, # Number of sequences to generate per prompt
temperature=0.8, # Sampling temperature (0.0 = greedy)
top_p=0.95, # Nucleus sampling threshold
top_k=50, # Top-k sampling (0 = disabled)
max_tokens=100, # Maximum tokens to generate
stop=[".", "!", "?"], # Stop strings
presence_penalty=0.0, # Penalize tokens that have appeared
frequency_penalty=0.0, # Penalize tokens based on frequency
repetition_penalty=1.0, # Penalize repeated tokens
logprobs=None, # Number of log probabilities to return
)Key Sampling Parameters
Temperature: Controls randomness in sampling
0.0: Greedy decoding (deterministic)0.1-0.7: More focused and coherent0.8-1.0: Balanced creativity>1.0: More random and diverse
Top-p (nucleus sampling): Cumulative probability threshold
0.9-0.95: Good balance for most tasks1.0: Consider all tokens
Top-k: Limit sampling to top-k most likely tokens
0or-1: Disabled50-100: Common values for creative tasks
Max tokens: Maximum number of tokens to generate
- Default:
16 - Set higher for longer outputs
Stop sequences: Strings that stop generation when encountered
- Example:
stop=[".", "\n\n", "END"]
Multiple Outputs per Prompt
Generate multiple completions for each prompt by setting n > 1:
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(
n=3, # Generate 3 completions per prompt
temperature=0.9,
max_tokens=50,
)
prompts = ["Once upon a time"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
for i, completion in enumerate(output.outputs):
print(f" Completion {i + 1}: {completion.text}")Log Probabilities
Request log probabilities for generated tokens:
from vllm import LLM, SamplingParams
llm = LLM(model="facebook/opt-125m")
sampling_params = SamplingParams(
temperature=0.8,
logprobs=5, # Return top 5 log probabilities per token
prompt_logprobs=5, # Return log probabilities for prompt tokens
)
outputs = llm.generate(["The capital of France is"], sampling_params)
for output in outputs:
# Prompt log probabilities
if output.prompt_logprobs:
print("Prompt logprobs:", output.prompt_logprobs)
# Generated token log probabilities
for completion in output.outputs:
if completion.logprobs:
print("Token logprobs:", completion.logprobs)Loading Models from ModelScope
By default, vLLM downloads models from Hugging Face. To use ModelScope instead:
import os
os.environ["VLLM_USE_MODELSCOPE"] = "True"
from vllm import LLM
llm = LLM(model="qwen/Qwen-7B")Distributed Inference
For large models that don’t fit on a single GPU, use tensor parallelism:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Use 4 GPUs
gpu_memory_utilization=0.95,
)Or pipeline parallelism:
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
pipeline_parallel_size=4, # Use 4 pipeline stages
)What’s Next
- OpenAI-Compatible Server: Serve models over HTTP with an OpenAI-compatible API for multi-client access.
- Engine Configuration: Control memory, concurrency, and model loading with EngineArgs.
- Supported Models: See the full list of model families vLLM supports.