AI & Machine Learning

Stream LLM Output in FastAPI: A 5-Step 2025 Tutorial

Learn how to stream LLM responses in real-time with FastAPI. This 5-step 2025 guide covers StreamingResponse, async generators, and frontend integration.

A

Adrian Novak

Senior Python & AI Engineer specializing in scalable, real-time machine learning applications.

7 min read4 views

Introduction: Why Stream LLM Responses in 2025?

In the world of AI applications, user experience is king. Nobody enjoys staring at a loading spinner for 10-30 seconds while a Large Language Model (LLM) processes a complex prompt. The perceived latency can make even the most powerful AI feel slow and clunky. This is where streaming comes in.

By streaming the LLM's output token-by-token, you can create a dynamic, real-time "typing" effect, just like you see in leading platforms like ChatGPT. This dramatically improves the user experience, providing immediate feedback and making your application feel significantly more responsive and interactive.

This 2025 tutorial will guide you through the entire process of implementing LLM response streaming using FastAPI, the high-performance Python web framework. We'll cover everything from project setup to creating an asynchronous generator and consuming the stream on a simple frontend. By the end, you'll have a robust, production-ready pattern for building next-generation AI interfaces.

Step 1: Setting Up Your FastAPI Project

First, let's get our development environment ready. We'll need a clean project structure, a virtual environment, and a few essential Python libraries.

Prerequisites

  • Python 3.9+ installed.
  • An OpenAI API key (or access to another streaming-capable LLM API).

Environment and Dependencies

1. Create a project directory and a virtual environment:

mkdir fastapi_llm_stream
cd fastapi_llm_stream
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

2. Install the necessary libraries: We need FastAPI for the server, Uvicorn to run it, and the OpenAI library to interact with the LLM.

pip install fastapi uvicorn[standard] openai python-dotenv

3. Set up your API Key: Create a file named .env in your project root and add your OpenAI API key.

# .env file
OPENAI_API_KEY="your-super-secret-key-here"

4. Create your main application file: Create a file named main.py. This is where our FastAPI application will live.

# main.py
from fastapi import FastAPI

app = FastAPI(
    title="LLM Streaming API",
    description="An API for streaming LLM responses.",
    version="1.0.0",
)

@app.get("/")
def read_root():
    return {"status": "API is running"}

Step 2: Understanding FastAPI's StreamingResponse

Before we write the core logic, it's crucial to understand how FastAPI handles streaming. The magic lies in the StreamingResponse class. Unlike a standard JSONResponse which waits for all data to be ready before sending it, StreamingResponse sends data in chunks as it becomes available.

This is achieved using an iterator or an asynchronous generator. FastAPI iterates over your generator, sending each yielded chunk to the client immediately. For text streaming to a browser, the most common and effective format is Server-Sent Events (SSE), which we'll use by setting the `media_type` to `text/event-stream`.

StreamingResponse vs. JSONResponse
Feature StreamingResponse JSONResponse
Data Transfer Sends data in chunks as it's generated. Sends the entire payload in a single response.
Perceived Latency Very low (Time to First Byte is fast). High (Waits for the full response to be computed).
Use Case LLM outputs, live data feeds, large file downloads. Standard API requests, structured data, configuration.
Client Implementation Requires an event listener (e.g., `EventSource`). Simple `fetch` and `.json()` call.

Step 3: Creating the Asynchronous LLM Generator

This is the heart of our streaming logic. We'll create an `async` function that acts as a generator. This function will call the OpenAI API, receive a stream of response chunks, and `yield` each one as it arrives.

Let's add this to our main.py file.

# main.py (add these imports and function)
import os
from openai import AsyncOpenAI
from dotenv import load_dotenv
from typing import AsyncGenerator

load_dotenv()

# Initialize the AsyncOpenAI client
client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

async def stream_llm_response(prompt: str) -> AsyncGenerator[str, None]:
    """
    Creates an async generator to stream responses from the OpenAI API.
    Formats each chunk for Server-Sent Events (SSE).
    """
    try:
        stream = await client.chat.completions.create(
            model="gpt-4-turbo", # Or your preferred model
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                # SSE format: data: {content}\n\n
                yield f"data: {content}\n\n"
    except Exception as e:
        print(f"An error occurred: {e}")
        # Yield a formatted error message for the client
        yield f"data: [ERROR] {str(e)}\n\n"

Code Breakdown

  • `async def` and `AsyncGenerator`: We define the function as asynchronous and use the `AsyncGenerator` type hint to signify it will yield values.
  • `AsyncOpenAI`: We use the asynchronous version of the OpenAI client, which is essential for non-blocking I/O in a FastAPI application.
  • `stream=True`: This is the critical parameter that tells the OpenAI API to send back a stream of chunks instead of a single response.
  • `async for chunk in stream`: We iterate over the stream asynchronously.
  • `yield f"data: {content}\n\n"`: For each piece of content we receive, we format it according to the Server-Sent Events (SSE) specification. The `data: ` prefix is required, and the message must end with two newlines (`\n\n`). This is how the browser's `EventSource` API will recognize it as a message.

Step 4: Building the Streaming API Endpoint

Now that we have our generator, let's create the FastAPI endpoint that uses it. We'll define a Pydantic model for our request body to ensure type safety and validation.

Add the following to your main.py:

# main.py (add these imports and endpoint)
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel

# ... (existing code) ...

class ChatRequest(BaseModel):
    prompt: str

@app.post("/api/chat/stream")
async def chat_endpoint(request: ChatRequest):
    """
    The main endpoint to receive a prompt and stream back the LLM response.
    """
    generator = stream_llm_response(request.prompt)
    return StreamingResponse(generator, media_type="text/event-stream")

Here, we simply take the `prompt` from the request body, pass it to our `stream_llm_response` generator, and wrap the resulting generator in a `StreamingResponse`. Setting `media_type="text/event-stream"` is vital for the browser to correctly interpret the response as an SSE stream.

Step 5: Consuming the Stream on the Frontend

A backend stream is useless without a frontend to consume it. Let's create a basic HTML and JavaScript file to interact with our API. The `EventSource` browser API is purpose-built for consuming SSE streams, making this surprisingly simple.

Create a file named index.html in your project root:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>FastAPI LLM Stream</title>
    <style>
        body { font-family: sans-serif; max-width: 800px; margin: auto; padding: 20px; }
        #response { border: 1px solid #ccc; padding: 10px; min-height: 100px; white-space: pre-wrap; }
    </style>
</head>
<body>
    <h1>LLM Streaming with FastAPI</h1>
    <form id="prompt-form">
        <textarea id="prompt-input" rows="4" cols="50" placeholder="Enter your prompt..."></textarea>
        <br>
        <button type="submit">Send</button>
    </form>
    <h3>LLM Response:</h3>
    <div id="response"></div>

    <script>
        const form = document.getElementById('prompt-form');
        const input = document.getElementById('prompt-input');
        const responseDiv = document.getElementById('response');

        form.addEventListener('submit', async (e) => {
            e.preventDefault();
            const prompt = input.value;
            responseDiv.innerHTML = ''; // Clear previous response

            const eventSource = new EventSource(`/api/chat/stream`);

            // Open connection and send prompt as part of the initial POST request
            // NOTE: EventSource doesn't support POST directly. A common pattern is to use fetch to initiate
            // and get a stream ID, or use a library. For simplicity, we'll use a library-less fetch approach.
            
            try {
                const response = await fetch('/api/chat/stream', {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json',
                    },
                    body: JSON.stringify({ prompt }),
                });

                const reader = response.body.getReader();
                const decoder = new TextDecoder();

                function processStream() {
                    reader.read().then(({ done, value }) => {
                        if (done) {
                            console.log('Stream complete');
                            return;
                        }
                        
                        const chunk = decoder.decode(value, { stream: true });
                        // SSE messages are in the format "data: {content}\n\n"
                        const lines = chunk.split('\n\n');
                        lines.forEach(line => {
                            if (line.startsWith('data: ')) {
                                const data = line.substring(6).trim();
                                if (data.startsWith('[ERROR]')) {
                                    responseDiv.innerHTML += `<strong style="color:red;">${data}</strong>`;
                                } else {
                                    responseDiv.innerHTML += data;
                                }
                            }
                        });

                        processStream(); // Continue reading the stream
                    });
                }
                processStream();

            } catch (error) {
                console.error('Error fetching stream:', error);
                responseDiv.innerText = 'Error connecting to the server.';
            }
        });
    </script>
</body>
</html>

Note: The JavaScript uses `fetch` with `response.body.getReader()` which provides more control than the basic `EventSource` API, especially for handling POST requests which `EventSource` does not natively support. This is a more robust, modern approach.

Putting It All Together: Complete Code

To make it easy to run, you'll need a way to serve the `index.html` file. We can add a simple endpoint to our FastAPI app for this.

Here is the final, complete main.py:

# main.py
import os
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse, HTMLResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
from dotenv import load_dotenv
from typing import AsyncGenerator

load_dotenv()

app = FastAPI(
    title="LLM Streaming API",
    description="An API for streaming LLM responses.",
    version="1.0.0",
)

client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

class ChatRequest(BaseModel):
    prompt: str

async def stream_llm_response(prompt: str) -> AsyncGenerator[str, None]:
    try:
        stream = await client.chat.completions.create(
            model="gpt-4-turbo",
            messages=[{"role": "user", "content": prompt}],
            stream=True,
        )
        async for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
    except Exception as e:
        print(f"An error occurred: {e}")
        yield f"data: [ERROR] {str(e)}\n\n"

@app.post("/api/chat/stream")
async def chat_endpoint(request: ChatRequest):
    generator = stream_llm_response(request.prompt)
    return StreamingResponse(generator, media_type="text/event-stream")

@app.get("/", response_class=HTMLResponse)
async def read_index():
    with open("index.html") as f:
        return HTMLResponse(content=f.read(), status_code=200)

Now, run your application from the terminal:

uvicorn main:app --reload

Open your browser and navigate to http://127.0.0.1:8000. Type a prompt into the text area, hit send, and watch the LLM response stream in real-time!

Conclusion: Your Next Steps in Real-Time AI

You've successfully built a fully functional, real-time LLM streaming application with FastAPI! This powerful technique is essential for creating modern, user-friendly AI products. By leveraging `StreamingResponse`, asynchronous generators, and Server-Sent Events, you can eliminate perceived latency and deliver a superior interactive experience.

From here, you can explore more advanced topics like streaming structured JSON objects, adding authentication to your endpoints, or handling client-side disconnects and backpressure more gracefully. The foundation you've built today is the perfect starting point for developing sophisticated, production-grade AI services.