Skip to Content
servingOpenai Server

Last Updated: 3/19/2026


OpenAI-Compatible Server

vLLM provides an HTTP server that implements the OpenAI API specification, making it a drop-in replacement for OpenAI’s API. This allows you to use existing OpenAI client libraries and tools with your self-hosted models.

Starting the Server

Start the server using the vllm serve command:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct

By default, the server starts on http://localhost:8000. You can customize the host and port:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ --host 0.0.0.0 \ --port 8080

Common Server Options

Model Configuration:

  • --dtype: Data type for model weights (auto, float16, bfloat16, float32)
  • --max-model-len: Maximum sequence length the model can handle
  • --gpu-memory-utilization: Fraction of GPU memory to use (default: 0.9)

Server Settings:

  • --host: Server host address (default: localhost)
  • --port: Server port (default: 8000)
  • --api-key: API key(s) for authentication (can specify multiple times)

Example with options:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ --dtype auto \ --api-key token-abc123 \ --max-model-len 4096 \ --gpu-memory-utilization 0.85

API Authentication

Enable API key authentication by passing the --api-key flag:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --api-key your-secret-key

You can specify multiple keys for key rotation:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct \ --api-key key1 \ --api-key key2

Alternatively, use the VLLM_API_KEY environment variable:

export VLLM_API_KEY=your-secret-key vllm serve NousResearch/Meta-Llama-3-8B-Instruct

Supported APIs

vLLM implements the following OpenAI API endpoints:

  • Completions API (/v1/completions): Text completion for base models
  • Chat Completions API (/v1/chat/completions): Chat-based interactions for instruction-tuned models
  • Embeddings API (/v1/embeddings): Generate embeddings with embedding models
  • Models API (/v1/models): List available models

Using the Completions API

The Completions API is designed for base language models that generate text continuations.

With curl

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Meta-Llama-3-8B-Instruct", "prompt": "San Francisco is a", "max_tokens": 50, "temperature": 0.7 }'

With OpenAI Python Client

from openai import OpenAI # Point to the vLLM server client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) completion = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="San Francisco is a", max_tokens=50, temperature=0.7, ) print("Completion:", completion.choices[0].text)

Streaming Completions

Enable streaming to receive tokens as they are generated:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) stream = client.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", prompt="Write a short story about a robot:", max_tokens=200, temperature=0.8, stream=True, ) for chunk in stream: if chunk.choices[0].text: print(chunk.choices[0].text, end="", flush=True)

Using the Chat Completions API

The Chat Completions API is designed for instruction-tuned and chat models that follow a conversational format.

With curl

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Meta-Llama-3-8B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], "temperature": 0.7 }'

With OpenAI Python Client

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) chat_response = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, ], temperature=0.7, ) print("Response:", chat_response.choices[0].message.content)

Streaming Chat Completions

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) stream = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke about programming."}, ], temperature=0.8, stream=True, ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True)

Chat Templates

For the Chat Completions API to work, the model must include a chat template in its tokenizer configuration. This template defines how messages are formatted for the model.

If a model doesn’t provide a chat template, you can specify one manually:

vllm serve <model> --chat-template ./path-to-chat-template.jinja

The vLLM community provides chat templates for popular models in the examples/ directory of the repository.

Using the Embeddings API

For embedding models, use the Embeddings API:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) response = client.embeddings.create( model="BAAI/bge-large-en-v1.5", input=["Hello, world!", "How are you?"], ) print("Embedding:", response.data[0].embedding)

Listing Available Models

Query the models endpoint to see what’s available:

curl http://localhost:8000/v1/models

Or with the Python client:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) models = client.models.list() for model in models.data: print(model.id)

Extra Parameters

vLLM supports parameters beyond the OpenAI API specification. Pass them using the extra_body parameter:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) completion = client.chat.completions.create( model="NousResearch/Meta-Llama-3-8B-Instruct", messages=[ {"role": "user", "content": "Hello!"}, ], extra_body={ "top_k": 50, "repetition_penalty": 1.1, }, )

Common extra parameters include:

  • top_k: Limit sampling to top-k tokens
  • repetition_penalty: Penalize repeated tokens
  • min_p: Minimum probability threshold for sampling
  • structured_outputs: Enable structured output generation

Default Sampling Parameters

By default, vLLM applies generation_config.json from the Hugging Face model repository if it exists. This means default sampling parameters may be overridden by model creator recommendations.

To disable this behavior and use vLLM’s defaults:

vllm serve NousResearch/Meta-Llama-3-8B-Instruct --generation-config vllm

Docker Deployment

vLLM provides official Docker images for easy deployment:

docker pull vllm/vllm-openai:latest docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model NousResearch/Meta-Llama-3-8B-Instruct

What’s Next

  • Offline Inference: Use the LLM class to run batch inference directly in Python without starting a server.
  • Engine Configuration: Tune memory utilization, concurrency limits, and model loading options with EngineArgs.
  • LoRA Adapters: Serve multiple fine-tuned LoRA adapters on top of a base model.