Last Updated: 3/19/2026
Quick Start
This guide will help you run your first inference with vLLM in under 5 minutes. We’ll cover both offline batch inference and the OpenAI-compatible server.
Prerequisites
Before starting, ensure you have:
- vLLM installed (see Installation)
- A Linux system with Python 3.10-3.13
- Access to a GPU (NVIDIA, AMD, or Google TPU)
Offline Batch Inference
The simplest way to use vLLM is through the LLM class for offline batch inference. This is ideal for processing multiple prompts without setting up a server.
Basic Example
Here’s a complete example that generates text for multiple prompts:
from vllm import LLM, SamplingParams
# Define your prompts
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Configure sampling parameters
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Initialize the LLM
llm = LLM(model="facebook/opt-125m")
# Generate outputs
outputs = llm.generate(prompts, sampling_params)
# Print results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")Understanding the Code
LLM Class: The LLM class is the main interface for running offline inference with vLLM. It initializes the engine and loads the model.
SamplingParams: This class specifies parameters for text generation:
temperature: Controls randomness (0.0 = deterministic, higher = more random)top_p: Nucleus sampling threshold (0.0-1.0)top_k: Limits sampling to top-k tokensmax_tokens: Maximum number of tokens to generate (default: 16)stop: Stop sequences that end generationpresence_penalty: Penalizes tokens that have appearedfrequency_penalty: Penalizes tokens based on frequency
Model Loading: By default, vLLM downloads models from Hugging Face. The model is automatically cached locally after the first download.
Using Chat Models
For instruction-tuned or chat models, use the llm.chat() method with properly formatted messages:
from vllm import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9)
messages_list = [
[{"role": "user", "content": "What is the capital of France?"}],
[{"role": "user", "content": "Explain quantum computing in simple terms."}],
]
outputs = llm.chat(messages_list, sampling_params)
for idx, output in enumerate(outputs):
generated_text = output.outputs[0].text
print(f"Response {idx + 1}: {generated_text}")The chat() method automatically applies the model’s chat template, ensuring correct formatting for instruction-following models.
Default Sampling Parameters
By default, vLLM uses sampling parameters recommended by the model creator from generation_config.json in the Hugging Face repository. This typically provides the best results without manual tuning.
To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance:
llm = LLM(model="facebook/opt-125m", generation_config="vllm")OpenAI-Compatible Server
vLLM provides an HTTP server that implements the OpenAI API, allowing you to use it as a drop-in replacement for OpenAI’s API.
Starting the Server
Start the server with a single command:
vllm serve Qwen/Qwen2.5-1.5B-InstructBy default, the server listens on http://localhost:8000. You can customize the address:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8080Using the Completions API
Once the server is running, you can query it using curl:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'Or use the OpenAI Python client:
from openai import OpenAI
# Point to the vLLM server
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
prompt="San Francisco is a",
max_tokens=50,
temperature=0.7,
)
print("Completion:", completion.choices[0].text)Using the Chat Completions API
For chat-based interactions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7
}'With the Python client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
chat_response = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.7,
)
print("Response:", chat_response.choices[0].message.content)API Authentication
To enable API key authentication, pass the --api-key flag when starting the server:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key your-secret-keyYou can specify multiple keys for key rotation:
vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key key1 --api-key key2Alternatively, set the VLLM_API_KEY environment variable:
export VLLM_API_KEY=your-secret-key
vllm serve Qwen/Qwen2.5-1.5B-InstructListing Available Models
Query the models endpoint to see what’s available:
curl http://localhost:8000/v1/modelsLoading Models from ModelScope
By default, vLLM downloads models from Hugging Face. To use ModelScope instead, set the VLLM_USE_MODELSCOPE environment variable:
export VLLM_USE_MODELSCOPE=TrueThen initialize the LLM or start the server as usual.
Next Steps
Now that you’ve run your first inference, explore these topics:
- OpenAI-Compatible Server: Learn about all supported APIs, chat templates, authentication, and extra parameters for the HTTP server.
- Offline Inference: Deep dive into the
LLMclass —generate(),chat(),encode(), and all SamplingParams options. - Engine Configuration: Tune memory, concurrency, and model loading with EngineArgs.