Last Updated: 3/19/2026

OpenAI-Compatible Server

vLLM provides an HTTP server that implements the OpenAI API specification, making it a drop-in replacement for OpenAI’s API. This allows you to use existing OpenAI client libraries and tools with your self-hosted models.

Starting the Server

Start the server using the vllm serve command:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct

By default, the server starts on http://localhost:8000. You can customize the host and port:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8080

Common Server Options

Model Configuration:

--dtype: Data type for model weights (auto, float16, bfloat16, float32)
--max-model-len: Maximum sequence length the model can handle
--gpu-memory-utilization: Fraction of GPU memory to use (default: 0.9)

Server Settings:

--host: Server host address (default: localhost)
--port: Server port (default: 8000)
--api-key: API key(s) for authentication (can specify multiple times)

Example with options:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --dtype auto \
  --api-key token-abc123 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

API Authentication

Enable API key authentication by passing the --api-key flag:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct --api-key your-secret-key

You can specify multiple keys for key rotation:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
  --api-key key1 \
  --api-key key2

Alternatively, use the VLLM_API_KEY environment variable:


export VLLM_API_KEY=your-secret-key
vllm serve NousResearch/Meta-Llama-3-8B-Instruct

Supported APIs

vLLM implements the following OpenAI API endpoints:

Completions API (/v1/completions): Text completion for base models
Chat Completions API (/v1/chat/completions): Chat-based interactions for instruction-tuned models
Embeddings API (/v1/embeddings): Generate embeddings with embedding models
Models API (/v1/models): List available models

Using the Completions API

The Completions API is designed for base language models that generate text continuations.

With curl


curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "NousResearch/Meta-Llama-3-8B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 50,
        "temperature": 0.7
    }'

With OpenAI Python Client


from openai import OpenAI
 
# Point to the vLLM server
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
completion = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="San Francisco is a",
    max_tokens=50,
    temperature=0.7,
)
 
print("Completion:", completion.choices[0].text)

Streaming Completions

Enable streaming to receive tokens as they are generated:


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
stream = client.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    prompt="Write a short story about a robot:",
    max_tokens=200,
    temperature=0.8,
    stream=True,
)
 
for chunk in stream:
    if chunk.choices[0].text:
        print(chunk.choices[0].text, end="", flush=True)

Using the Chat Completions API

The Chat Completions API is designed for instruction-tuned and chat models that follow a conversational format.

With curl


curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "NousResearch/Meta-Llama-3-8B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "temperature": 0.7
    }'

With OpenAI Python Client


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
chat_response = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0.7,
)
 
print("Response:", chat_response.choices[0].message.content)

Streaming Chat Completions


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
stream = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke about programming."},
    ],
    temperature=0.8,
    stream=True,
)
 
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Chat Templates

For the Chat Completions API to work, the model must include a chat template in its tokenizer configuration. This template defines how messages are formatted for the model.

If a model doesn’t provide a chat template, you can specify one manually:


vllm serve <model> --chat-template ./path-to-chat-template.jinja

The vLLM community provides chat templates for popular models in the examples/ directory of the repository.

Using the Embeddings API

For embedding models, use the Embeddings API:


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=["Hello, world!", "How are you?"],
)
 
print("Embedding:", response.data[0].embedding)

Listing Available Models

Query the models endpoint to see what’s available:


curl http://localhost:8000/v1/models

Or with the Python client:


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
models = client.models.list()
for model in models.data:
    print(model.id)

Extra Parameters

vLLM supports parameters beyond the OpenAI API specification. Pass them using the extra_body parameter:


from openai import OpenAI
 
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
 
completion = client.chat.completions.create(
    model="NousResearch/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"},
    ],
    extra_body={
        "top_k": 50,
        "repetition_penalty": 1.1,
    },
)

Common extra parameters include:

top_k: Limit sampling to top-k tokens
repetition_penalty: Penalize repeated tokens
min_p: Minimum probability threshold for sampling
structured_outputs: Enable structured output generation

Default Sampling Parameters

By default, vLLM applies generation_config.json from the Hugging Face model repository if it exists. This means default sampling parameters may be overridden by model creator recommendations.

To disable this behavior and use vLLM’s defaults:


vllm serve NousResearch/Meta-Llama-3-8B-Instruct --generation-config vllm

Docker Deployment

vLLM provides official Docker images for easy deployment:


docker pull vllm/vllm-openai:latest
 
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model NousResearch/Meta-Llama-3-8B-Instruct

What’s Next

Offline Inference: Use the LLM class to run batch inference directly in Python without starting a server.
Engine Configuration: Tune memory utilization, concurrency limits, and model loading options with EngineArgs.
LoRA Adapters: Serve multiple fine-tuned LoRA adapters on top of a base model.