AI responses take time to generate. Waiting for a complete response creates a poor user experience—users stare at a spinner for seconds. Streaming shows output token by token as it's generated, making responses feel instant and engaging.
Why stream?
| Approach | User experience |
|---|---|
| Wait for complete | Spinner for 2-5 seconds, then full response |
| Stream | First word appears in ~200ms, text flows naturally |
Streaming reduces perceived latency dramatically. Users start reading immediately.
How streaming works
AI APIs generate tokens sequentially. With streaming:
- Client sends request with
stream: true - Server starts generating and sends tokens as they're ready
- Each token arrives as a chunk in the response
- Client appends chunks to display text
Client Request → Server → AI API
↓
Token 1 → Client (display)
Token 2 → Client (append)
Token 3 → Client (append)
...
[DONE] → Client (complete)
Basic streaming with fetch
async function streamCompletion(prompt: string) {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
});
const reader = response.body?.getReader();
const decoder = new TextDecoder();
if (!reader) throw new Error('No reader');
let result = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
result += chunk;
// Update UI with each chunk
updateDisplay(result);
}
return result;
}
Next.js Route Handler
Create an API route that proxies and streams AI responses:
// app/api/chat/route.ts
import { OpenAI } from 'openai';
const openai = new OpenAI();
export async function POST(request: Request) {
const { prompt } = await request.json();
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }],
stream: true,
});
// Create a TransformStream to process the response
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
for await (const chunk of response) {
const content = chunk.choices[0]?.delta?.content || '';
controller.enqueue(encoder.encode(content));
}
controller.close();
},
});
return new Response(stream, {
headers: {
'Content-Type': 'text/plain; charset=utf-8',
'Transfer-Encoding': 'chunked',
},
});
}
React component for streaming
'use client';
import { useState, useRef } from 'react';
export function Chat() {
const [input, setInput] = useState('');
const [response, setResponse] = useState('');
const [isStreaming, setIsStreaming] = useState(false);
const abortControllerRef = useRef<AbortController | null>(null);
async function handleSubmit(e: React.FormEvent) {
e.preventDefault();
if (!input.trim() || isStreaming) return;
setResponse('');
setIsStreaming(true);
abortControllerRef.current = new AbortController();
try {
const res = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt: input }),
signal: abortControllerRef.current.signal,
});
const reader = res.body?.getReader();
const decoder = new TextDecoder();
if (!reader) throw new Error('No reader');
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
setResponse((prev) => prev + chunk);
}
} catch (error) {
if (error instanceof Error && error.name === 'AbortError') {
console.log('Request cancelled');
} else {
console.error('Stream error:', error);
}
} finally {
setIsStreaming(false);
}
}
function handleCancel() {
abortControllerRef.current?.abort();
}
return (
<div>
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Ask something..."
/>
<button type="submit" disabled={isStreaming}>
{isStreaming ? 'Generating...' : 'Send'}
</button>
{isStreaming && (
<button type="button" onClick={handleCancel}>
Cancel
</button>
)}
</form>
<div className="whitespace-pre-wrap">{response}</div>
</div>
);
}
Vercel AI SDK
The Vercel AI SDK simplifies streaming with built-in hooks:
npm install ai openai
// app/api/chat/route.ts
import { openai } from '@ai-sdk/openai';
import { streamText } from 'ai';
export async function POST(request: Request) {
const { messages } = await request.json();
const result = streamText({
model: openai('gpt-4'),
messages,
});
return result.toDataStreamResponse();
}
// components/Chat.tsx
'use client';
import { useChat } from 'ai/react';
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat();
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={handleInputChange}
placeholder="Say something..."
/>
<button type="submit" disabled={isLoading}>
Send
</button>
</form>
</div>
);
}
The SDK handles:
- Stream parsing
- Message state management
- Abort handling
- Error recovery
Server-Sent Events format
Many AI APIs use SSE format:
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Parse SSE in the stream:
async function parseSSE(response: Response) {
const reader = response.body?.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader!.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() || '';
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') return;
try {
const parsed = JSON.parse(data);
const content = parsed.choices[0]?.delta?.content;
if (content) yield content;
} catch {
// Skip invalid JSON
}
}
}
}
}
Auto-scroll while streaming
Keep the latest content visible:
function StreamingContent({ content }: { content: string }) {
const containerRef = useRef<HTMLDivElement>(null);
useEffect(() => {
if (containerRef.current) {
containerRef.current.scrollTop = containerRef.current.scrollHeight;
}
}, [content]);
return (
<div ref={containerRef} className="overflow-auto max-h-96">
{content}
</div>
);
}
Rendering markdown while streaming
For markdown content, wait for complete blocks:
import ReactMarkdown from 'react-markdown';
function StreamingMarkdown({ content }: { content: string }) {
// Simple approach: render as markdown continuously
return <ReactMarkdown>{content}</ReactMarkdown>;
// Better: buffer incomplete code blocks
// to avoid flickering
}
For code blocks, consider buffering until the closing fence:
function bufferCodeBlocks(content: string): string {
const codeBlockRegex = /```[\s\S]*?```/g;
const incompleteBlock = /```[^`]*$/;
if (incompleteBlock.test(content)) {
// Hide incomplete code block
return content.replace(incompleteBlock, '');
}
return content;
}
Error handling
Handle network errors and API failures:
try {
const response = await fetch('/api/chat', { ... });
if (!response.ok) {
const error = await response.json();
throw new Error(error.message || 'Request failed');
}
// Stream processing...
} catch (error) {
if (error instanceof Error) {
if (error.name === 'AbortError') {
// User cancelled - not an error
return;
}
setError(error.message);
}
}
Rate limiting and retries
async function streamWithRetry(prompt: string, retries = 3) {
for (let i = 0; i < retries; i++) {
try {
return await streamCompletion(prompt);
} catch (error) {
if (i === retries - 1) throw error;
// Exponential backoff
await new Promise((r) => setTimeout(r, 1000 * Math.pow(2, i)));
}
}
}
Summary
Streaming AI responses improves perceived performance by showing output as it's generated. Use fetch with ReadableStream for basic streaming. The Vercel AI SDK simplifies implementation with useChat and useCompletion hooks. Handle SSE format for most AI APIs. Implement auto-scroll, markdown rendering, and proper error handling for a polished experience.
