Skip to Content
servingOffline Inference

Last Updated: 3/19/2026


Offline Inference

Offline inference allows you to run batch inference in your own code without starting a server. This is ideal for processing large datasets, running evaluations, or experimenting with models.

Basic Usage

The LLM class is the main interface for offline inference. Here’s a simple example:

from vllm import LLM, SamplingParams # Initialize the LLM llm = LLM(model="facebook/opt-125m") # Define prompts prompts = [ "Hello, my name is", "The capital of France is", "The future of AI is", ] # Configure sampling parameters sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Generate outputs outputs = llm.generate(prompts, sampling_params) # Process results for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated: {generated_text!r}")

LLM Class Initialization

The LLM class accepts various parameters to configure the model and engine:

from vllm import LLM llm = LLM( model="facebook/opt-125m", dtype="auto", # Data type: auto, float16, bfloat16, float32 gpu_memory_utilization=0.9, # Fraction of GPU memory to use max_model_len=2048, # Maximum sequence length tensor_parallel_size=1, # Number of GPUs for tensor parallelism trust_remote_code=False, # Whether to trust remote code )

Common Initialization Parameters

  • model: Model name or path (from Hugging Face or local directory)
  • dtype: Data type for model weights (auto, float16, bfloat16, float32)
  • gpu_memory_utilization: Fraction of GPU memory to use (default: 0.9)
  • max_model_len: Maximum sequence length the model can handle
  • tensor_parallel_size: Number of GPUs for tensor parallelism
  • pipeline_parallel_size: Number of pipeline stages for pipeline parallelism
  • trust_remote_code: Whether to trust and execute remote code from model repos

The generate() Method

The generate() method processes a list of prompts and returns completions:

outputs = llm.generate(prompts, sampling_params)

Input Format

prompts can be:

  • A single string: "Hello, world!"
  • A list of strings: ["Hello", "Hi there"]
  • A list of token IDs: [[1, 2, 3], [4, 5, 6]]

Output Format

The method returns a list of RequestOutput objects. Each RequestOutput contains:

  • request_id: Unique identifier for the request
  • prompt: The original prompt string
  • prompt_token_ids: Token IDs of the prompt
  • outputs: List of CompletionOutput objects (one per generated sequence)
  • finished: Whether generation is complete

Each CompletionOutput contains:

  • index: Index of this output in the request
  • text: Generated text
  • token_ids: Token IDs of generated text
  • cumulative_logprob: Cumulative log probability of the sequence
  • logprobs: Log probabilities of tokens (if requested)
  • finish_reason: Why generation stopped (e.g., “length”, “stop”)

The chat() Method

For instruction-tuned or chat models, use the chat() method with message-based input:

from vllm import LLM, SamplingParams llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct") sampling_params = SamplingParams(temperature=0.7, top_p=0.9) messages_list = [ [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], [ {"role": "user", "content": "Explain quantum computing in simple terms."} ], ] outputs = llm.chat(messages_list, sampling_params) for output in outputs: generated_text = output.outputs[0].text print(f"Response: {generated_text}")

The chat() method automatically applies the model’s chat template, ensuring proper formatting for instruction-following models.

The encode() Method

For embedding models, use the encode() method to generate embeddings:

from vllm import LLM llm = LLM(model="BAAI/bge-large-en-v1.5", task="embed") texts = [ "Hello, world!", "How are you?", "Machine learning is fascinating.", ] outputs = llm.encode(texts) for output in outputs: embedding = output.outputs.embedding print(f"Embedding dimension: {len(embedding)}")

SamplingParams Configuration

The SamplingParams class controls text generation behavior:

from vllm import SamplingParams sampling_params = SamplingParams( n=1, # Number of sequences to generate per prompt temperature=0.8, # Sampling temperature (0.0 = greedy) top_p=0.95, # Nucleus sampling threshold top_k=50, # Top-k sampling (0 = disabled) max_tokens=100, # Maximum tokens to generate stop=[".", "!", "?"], # Stop strings presence_penalty=0.0, # Penalize tokens that have appeared frequency_penalty=0.0, # Penalize tokens based on frequency repetition_penalty=1.0, # Penalize repeated tokens logprobs=None, # Number of log probabilities to return )

Key Sampling Parameters

Temperature: Controls randomness in sampling

  • 0.0: Greedy decoding (deterministic)
  • 0.1-0.7: More focused and coherent
  • 0.8-1.0: Balanced creativity
  • >1.0: More random and diverse

Top-p (nucleus sampling): Cumulative probability threshold

  • 0.9-0.95: Good balance for most tasks
  • 1.0: Consider all tokens

Top-k: Limit sampling to top-k most likely tokens

  • 0 or -1: Disabled
  • 50-100: Common values for creative tasks

Max tokens: Maximum number of tokens to generate

  • Default: 16
  • Set higher for longer outputs

Stop sequences: Strings that stop generation when encountered

  • Example: stop=[".", "\n\n", "END"]

Multiple Outputs per Prompt

Generate multiple completions for each prompt by setting n > 1:

from vllm import LLM, SamplingParams llm = LLM(model="facebook/opt-125m") sampling_params = SamplingParams( n=3, # Generate 3 completions per prompt temperature=0.9, max_tokens=50, ) prompts = ["Once upon a time"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(f"Prompt: {output.prompt}") for i, completion in enumerate(output.outputs): print(f" Completion {i + 1}: {completion.text}")

Log Probabilities

Request log probabilities for generated tokens:

from vllm import LLM, SamplingParams llm = LLM(model="facebook/opt-125m") sampling_params = SamplingParams( temperature=0.8, logprobs=5, # Return top 5 log probabilities per token prompt_logprobs=5, # Return log probabilities for prompt tokens ) outputs = llm.generate(["The capital of France is"], sampling_params) for output in outputs: # Prompt log probabilities if output.prompt_logprobs: print("Prompt logprobs:", output.prompt_logprobs) # Generated token log probabilities for completion in output.outputs: if completion.logprobs: print("Token logprobs:", completion.logprobs)

Loading Models from ModelScope

By default, vLLM downloads models from Hugging Face. To use ModelScope instead:

import os os.environ["VLLM_USE_MODELSCOPE"] = "True" from vllm import LLM llm = LLM(model="qwen/Qwen-7B")

Distributed Inference

For large models that don’t fit on a single GPU, use tensor parallelism:

from vllm import LLM llm = LLM( model="meta-llama/Llama-2-70b-hf", tensor_parallel_size=4, # Use 4 GPUs gpu_memory_utilization=0.95, )

Or pipeline parallelism:

from vllm import LLM llm = LLM( model="meta-llama/Llama-2-70b-hf", pipeline_parallel_size=4, # Use 4 pipeline stages )

What’s Next

  • OpenAI-Compatible Server: Serve models over HTTP with an OpenAI-compatible API for multi-client access.
  • Engine Configuration: Control memory, concurrency, and model loading with EngineArgs.
  • Supported Models: See the full list of model families vLLM supports.