Last Updated: 3/19/2026
OpenAI-Compatible Server
vLLM provides an HTTP server that implements the OpenAI API specification, making it a drop-in replacement for OpenAI’s API. This allows you to use existing OpenAI client libraries and tools with your self-hosted models.
Starting the Server
Start the server using the vllm serve command:
vllm serve NousResearch/Meta-Llama-3-8B-InstructBy default, the server starts on http://localhost:8000. You can customize the host and port:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8080Common Server Options
Model Configuration:
--dtype: Data type for model weights (auto, float16, bfloat16, float32)--max-model-len: Maximum sequence length the model can handle--gpu-memory-utilization: Fraction of GPU memory to use (default: 0.9)
Server Settings:
--host: Server host address (default: localhost)--port: Server port (default: 8000)--api-key: API key(s) for authentication (can specify multiple times)
Example with options:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85API Authentication
Enable API key authentication by passing the --api-key flag:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --api-key your-secret-keyYou can specify multiple keys for key rotation:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--api-key key1 \
--api-key key2Alternatively, use the VLLM_API_KEY environment variable:
export VLLM_API_KEY=your-secret-key
vllm serve NousResearch/Meta-Llama-3-8B-InstructSupported APIs
vLLM implements the following OpenAI API endpoints:
- Completions API (
/v1/completions): Text completion for base models - Chat Completions API (
/v1/chat/completions): Chat-based interactions for instruction-tuned models - Embeddings API (
/v1/embeddings): Generate embeddings with embedding models - Models API (
/v1/models): List available models
Using the Completions API
The Completions API is designed for base language models that generate text continuations.
With curl
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'With OpenAI Python Client
from openai import OpenAI
# Point to the vLLM server
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="San Francisco is a",
max_tokens=50,
temperature=0.7,
)
print("Completion:", completion.choices[0].text)Streaming Completions
Enable streaming to receive tokens as they are generated:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
stream = client.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
prompt="Write a short story about a robot:",
max_tokens=200,
temperature=0.8,
stream=True,
)
for chunk in stream:
if chunk.choices[0].text:
print(chunk.choices[0].text, end="", flush=True)Using the Chat Completions API
The Chat Completions API is designed for instruction-tuned and chat models that follow a conversational format.
With curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "NousResearch/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7
}'With OpenAI Python Client
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
chat_response = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
],
temperature=0.7,
)
print("Response:", chat_response.choices[0].message.content)Streaming Chat Completions
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
stream = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke about programming."},
],
temperature=0.8,
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Chat Templates
For the Chat Completions API to work, the model must include a chat template in its tokenizer configuration. This template defines how messages are formatted for the model.
If a model doesn’t provide a chat template, you can specify one manually:
vllm serve <model> --chat-template ./path-to-chat-template.jinjaThe vLLM community provides chat templates for popular models in the examples/ directory of the repository.
Using the Embeddings API
For embedding models, use the Embeddings API:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input=["Hello, world!", "How are you?"],
)
print("Embedding:", response.data[0].embedding)Listing Available Models
Query the models endpoint to see what’s available:
curl http://localhost:8000/v1/modelsOr with the Python client:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
models = client.models.list()
for model in models.data:
print(model.id)Extra Parameters
vLLM supports parameters beyond the OpenAI API specification. Pass them using the extra_body parameter:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.chat.completions.create(
model="NousResearch/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "Hello!"},
],
extra_body={
"top_k": 50,
"repetition_penalty": 1.1,
},
)Common extra parameters include:
top_k: Limit sampling to top-k tokensrepetition_penalty: Penalize repeated tokensmin_p: Minimum probability threshold for samplingstructured_outputs: Enable structured output generation
Default Sampling Parameters
By default, vLLM applies generation_config.json from the Hugging Face model repository if it exists. This means default sampling parameters may be overridden by model creator recommendations.
To disable this behavior and use vLLM’s defaults:
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --generation-config vllmDocker Deployment
vLLM provides official Docker images for easy deployment:
docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model NousResearch/Meta-Llama-3-8B-InstructWhat’s Next
- Offline Inference: Use the
LLMclass to run batch inference directly in Python without starting a server. - Engine Configuration: Tune memory utilization, concurrency limits, and model loading options with EngineArgs.
- LoRA Adapters: Serve multiple fine-tuned LoRA adapters on top of a base model.