Skip to Content
getting-startedQuick Start

Last Updated: 3/19/2026


Quick Start

This guide will help you run your first inference with vLLM in under 5 minutes. We’ll cover both offline batch inference and the OpenAI-compatible server.

Prerequisites

Before starting, ensure you have:

  • vLLM installed (see Installation)
  • A Linux system with Python 3.10-3.13
  • Access to a GPU (NVIDIA, AMD, or Google TPU)

Offline Batch Inference

The simplest way to use vLLM is through the LLM class for offline batch inference. This is ideal for processing multiple prompts without setting up a server.

Basic Example

Here’s a complete example that generates text for multiple prompts:

from vllm import LLM, SamplingParams # Define your prompts prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Configure sampling parameters sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # Initialize the LLM llm = LLM(model="facebook/opt-125m") # Generate outputs outputs = llm.generate(prompts, sampling_params) # Print results for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Understanding the Code

LLM Class: The LLM class is the main interface for running offline inference with vLLM. It initializes the engine and loads the model.

SamplingParams: This class specifies parameters for text generation:

  • temperature: Controls randomness (0.0 = deterministic, higher = more random)
  • top_p: Nucleus sampling threshold (0.0-1.0)
  • top_k: Limits sampling to top-k tokens
  • max_tokens: Maximum number of tokens to generate (default: 16)
  • stop: Stop sequences that end generation
  • presence_penalty: Penalizes tokens that have appeared
  • frequency_penalty: Penalizes tokens based on frequency

Model Loading: By default, vLLM downloads models from Hugging Face. The model is automatically cached locally after the first download.

Using Chat Models

For instruction-tuned or chat models, use the llm.chat() method with properly formatted messages:

from vllm import LLM, SamplingParams llm = LLM(model="Qwen/Qwen2.5-1.5B-Instruct") sampling_params = SamplingParams(temperature=0.7, top_p=0.9) messages_list = [ [{"role": "user", "content": "What is the capital of France?"}], [{"role": "user", "content": "Explain quantum computing in simple terms."}], ] outputs = llm.chat(messages_list, sampling_params) for idx, output in enumerate(outputs): generated_text = output.outputs[0].text print(f"Response {idx + 1}: {generated_text}")

The chat() method automatically applies the model’s chat template, ensuring correct formatting for instruction-following models.

Default Sampling Parameters

By default, vLLM uses sampling parameters recommended by the model creator from generation_config.json in the Hugging Face repository. This typically provides the best results without manual tuning.

To use vLLM’s default sampling parameters instead, set generation_config="vllm" when creating the LLM instance:

llm = LLM(model="facebook/opt-125m", generation_config="vllm")

OpenAI-Compatible Server

vLLM provides an HTTP server that implements the OpenAI API, allowing you to use it as a drop-in replacement for OpenAI’s API.

Starting the Server

Start the server with a single command:

vllm serve Qwen/Qwen2.5-1.5B-Instruct

By default, the server listens on http://localhost:8000. You can customize the address:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --host 0.0.0.0 --port 8080

Using the Completions API

Once the server is running, you can query it using curl:

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "prompt": "San Francisco is a", "max_tokens": 50, "temperature": 0.7 }'

Or use the OpenAI Python client:

from openai import OpenAI # Point to the vLLM server client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) completion = client.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", prompt="San Francisco is a", max_tokens=50, temperature=0.7, ) print("Completion:", completion.choices[0].text)

Using the Chat Completions API

For chat-based interactions:

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ], "temperature": 0.7 }'

With the Python client:

from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) chat_response = client.chat.completions.create( model="Qwen/Qwen2.5-1.5B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, ], temperature=0.7, ) print("Response:", chat_response.choices[0].message.content)

API Authentication

To enable API key authentication, pass the --api-key flag when starting the server:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key your-secret-key

You can specify multiple keys for key rotation:

vllm serve Qwen/Qwen2.5-1.5B-Instruct --api-key key1 --api-key key2

Alternatively, set the VLLM_API_KEY environment variable:

export VLLM_API_KEY=your-secret-key vllm serve Qwen/Qwen2.5-1.5B-Instruct

Listing Available Models

Query the models endpoint to see what’s available:

curl http://localhost:8000/v1/models

Loading Models from ModelScope

By default, vLLM downloads models from Hugging Face. To use ModelScope instead, set the VLLM_USE_MODELSCOPE environment variable:

export VLLM_USE_MODELSCOPE=True

Then initialize the LLM or start the server as usual.

Next Steps

Now that you’ve run your first inference, explore these topics:

  • OpenAI-Compatible Server: Learn about all supported APIs, chat templates, authentication, and extra parameters for the HTTP server.
  • Offline Inference: Deep dive into the LLM class — generate(), chat(), encode(), and all SamplingParams options.
  • Engine Configuration: Tune memory, concurrency, and model loading with EngineArgs.