AI News Hub Logo

AI News Hub

Part 8 — Token-by-Token: Why AI Generates Text One Word at a Time (And Why It Costs 4x More)

DEV Community
Mohamed Hamed

THE HIDDEN TAX OF AI INPUT COST $2.50 Per 1M Tokens (GPT-4o) 4x MORE OUTPUT COST $10.00 Per 1M Tokens (GPT-4o) The reason? The AI writes very slowly on the inside — one token at a time. Last article we saw the Transformer architecture. Today we watch it in action during live generation — and discover why the output side is 4x more expensive. Here's something that surprises most developers when they first hear it: ChatGPT doesn't think its answer in advance and then display it. It predicts one token. Then another. Then another. Each prediction uses the previous ones as context. It's not writing — it's recursively predicting. Remember how the Transformer reads everything in parallel (previous article)? Generation flips that on its head — now it's forced to be sequential because each new token depends on the last. And understanding this one fact changes how you design prompts, control API costs, build streaming UIs, and debug unexpected AI behavior. Strip away all the complexity and a large language model does exactly one thing: Given all the tokens it has seen so far, predict the single most likely next token. 📜 CONTEXT → 🤖 LLM → ✨ ONE TOKEN ↩ RECURSIVE LOOP — output fed back as next input Think of it like predictive text on your phone — except instead of suggesting 3 words, it's choosing from 100,000+ possible tokens, and it does this thousands of times to build a complete response. Let's trace through a real example. Prompt: "What's the best smart glasses?" "What's the best smart glasses?" + [START] → "Ray" — 35% ⭐ Step 1 "What's...glasses?" + "Ray" → "-" — 85% ⭐ Step 2 "What's...glasses?" + "Ray-" → "Ban" — 95% ⭐ Step 3 "...Ray-Ban" + all prev → "Meta" → "Ultra" → "because" → ... → [END] ✅ Final response (assembled from sequential predictions): "Ray-Ban Meta Ultra — lightweight, 48MP camera, translates 40 languages, full-day battery." Generated token-by-token — never computed all at once. The formal name for this process is autoregressive generation — each output token becomes part of the input for the next prediction. (This is the same "next-token prediction" that the training loop from Article 4 taught the model to do — except now it's happening live during inference.) This creates a critical asymmetry in how the model works: Response Length Generation Steps Implication 10 tokens (~8 words) 10 sequential predictions Fast, cheap 100 tokens (~75 words) 100 sequential predictions Moderate 1,000 tokens (~750 words) 1,000 sequential predictions Slow, expensive 4,000 tokens (a blog post) 4,000 sequential predictions Very slow, very expensive This is why output tokens cost 4x more than input tokens. Reading your 10,000-token prompt can be largely parallelized. But generating each output token requires a sequential forward pass through the full model — there's no way to batch or parallelize this without changing the output. At each generation step, the model doesn't just know the one "right" answer. It produces a probability distribution over its entire vocabulary — every possible next token, each with a likelihood score. Probability Distribution After "What's the best smart..." "Ray" 35% "Apple" 20% "Meta" 15% "currently" 8% ...others ~22% (100K tokens) ⚠️ The model doesn't always pick the highest-probability token — that's controlled by Temperature (a topic for another article). This is the same softmax activation we saw inside the neuron (Article 3) and Transformer block (Article 6) — here it converts raw logits over the full vocabulary into a probability distribution over what to say next. The model selects one token, appends it to the context, and runs the entire prediction process again. This continues until it generates an [END] token or hits a maximum length. Here's the obvious problem with autoregressive generation: if each new token requires the model to re-read the entire context (your prompt + all previous outputs), the computation time would grow quadratically. A 1,000-token response from a 10,000-token prompt would be impossibly slow. The solution is the KV Cache (Key-Value Cache): ❌ Without KV Cache Every new token requires reprocessing the entire context from scratch Token 1: read all 10K input Token 2: read all 10K input again Token 3: read all 10K input again ... (10,000x overhead per token) Slow + Expensive 💸 ⚡ With KV Cache The attention Keys and Values for processed tokens are stored and reused Input: compute KV once, cache Token 1: only compute new token Token 2: only compute new token ... (reuse cached KVs) Fast + Smart 🚀 How it works technically: During the Transformer's attention computation, every token produces a Key (K) and Value (V) vector. These don't change for tokens already processed. The KV Cache stores them in GPU memory, so each new generation step only needs to compute the K and V for the one new token. This reuse is only possible because of the self-attention mechanism from the previous article — without Q/K/V, there would be nothing to cache. This is also why reading input is cheaper than generating output — the entire input can be processed in one forward pass with full parallelization, while output tokens must be generated one at a time even with the cache. Every generation step is a partial forward pass through the full Transformer stack: 1. The new token passes through Positional Encoding (Article 6) — it gets a position vector so the model knows it's token #347, not #1. 2. Multi-Head Self-Attention runs — but with KV Cache, only the new token's Q is computed fresh; all previous K/V pairs are retrieved from cache. 3. The result flows through the Feed-Forward layers (where the neurons from Article 3 live) — all 96 layers, stacked. 4. The final layer outputs a probability distribution via softmax over the 100K+ vocabulary — one token is selected, appended, and the loop repeats. When building AI applications, two performance metrics dominate: ⚡ TTFT (Time to First Token) How long before the user sees the first word of the response. Dominated by: Input processing time. Bigger prompts = longer TTFT. Why it matters: Users perceive TTFT as "responsiveness." A 3-second TTFT feels laggy even if generation speed is fast. 📊 Throughput (Tokens/Second) How fast the model generates tokens after the first one appears. Dominated by: Model size, hardware, and batch efficiency. Why it matters: For long responses, throughput determines total completion time. GPT-4o: ~100-150 tok/s. Gemini Flash: ~300+ tok/s. ⚡ Developer tip: TTFT is your user experience problem Since output generation is 4x more expensive than input processing, how you instruct the model affects your bill more than how much data you send. Scenario: 1,000 API calls per day on GPT-4o ❌ "Write a detailed response" — 500 output tokens 500 tok × 1,000 calls = 500K tokens 500K ÷ 1M × $10.00 $5.00/day → $150/month ✅ "Be concise, 1-2 sentences" — 100 output tokens 100 tok × 1,000 calls = 100K tokens 100K ÷ 1M × $10.00 $1.00/day → $30/month Same task. Same quality. 5x cost difference. Just by controlling output length in your prompt. Since the model generates token-by-token anyway, streaming is free — you can show each token to the user as it's produced instead of waiting for the complete response. Without Streaming User stares at a blank screen for 5 seconds. Then the entire 400-word response appears at once. Perceived as "slow AI." With Streaming User sees the first word appear in 0.5 seconds, then watches the response build. Perceived as "fast, responsive AI" — even if total time is the same. from openai import OpenAI client = OpenAI() # Streaming example — show tokens as they arrive response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What are the best smart glasses in 2026?"}], stream=True, # ← This is all you need — stream=True is free and transforms UX max_tokens=150 # ← Control output length = control cost ) for chunk in response: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) # Output appears token by token, not all at once This is literally how ChatGPT's web interface works — the streaming appearance is the natural behavior of the model surfaced directly to the user. Here's the most important practical implication of token-by-token generation: The model cannot go back and correct a previous token. Once "Ray" is generated and added to the context, the model is committed. Every subsequent token is conditioned on "Ray" appearing there. If the model had wanted to say "Apple" but statistical chance led it to generate "Ray" first, it now has to generate something coherent following "Ray" — it cannot reconsider. Practical implication 1: Prompt quality matters more than you think If your prompt is ambiguous, the model might generate an early token that commits it to the wrong interpretation. It will then generate a coherent-but-wrong response. Better prompts → better first tokens → better entire responses. Practical implication 2: This is why hallucination happens If the model generates a confident-sounding but wrong fact early in a response, it doesn't "realize" the mistake — it just continues generating tokens that are consistent with the wrong fact. This is why early hallucinations are so hard to fix — by the time the model "knows" it's on the wrong track, it has already committed 50 tokens to a false premise. The next article covers hallucination in depth. Practical implication 3: Output format instructions help If you instruct the model to output JSON or markdown at the start, it will generate the opening `{` or `#` token first, which statistically primes all subsequent tokens to follow that format. Prompts like "respond in JSON" work because they shape the first-token probability distribution. Concept What It Means Action For You Autoregressive Each token depends on all previous tokens Longer outputs = more time + more cost Output costs 4x Generating > reading (sequential vs parallel) Use max_tokens; prompt for conciseness KV Cache Input attention scores are cached and reused Enable prompt caching for repeated system prompts TTFT Time to first token — perceived as "speed" Keep prompts lean; always use streaming Streaming Show tokens as they're generated Always enable in user-facing apps Irreversible The model can't backtrack and fix errors Use clear prompts; consider structured outputs Cost Formula Total cost = (input_tokens × input_price) + (output_tokens × output_price × 4) Always estimate output tokens before building at scale ChatGPT doesn't think — it predicts 💡 What Knowing Token Generation Changes For You 1. Always enable streaming in user-facing apps. It costs nothing extra and makes responses feel 3-5x faster to users. The perceived latency drop is the biggest free UX win in AI development. 2. Output length is your biggest cost lever. The difference between "explain in detail" and "explain in 2 sentences" can be a 5-10x cost reduction with no quality loss for many tasks. 3. Put output format instructions first. "Respond in JSON:" as the first line of your prompt statistically primes the first token to be {, which propagates through every subsequent token. The model doesn't plan ahead — it just follows the path its first token started. 4. Enable prompt caching for repeated system prompts. Anthropic and OpenAI both offer prompt caching — if your system prompt is 5,000 tokens and you send 10,000 requests/day, caching can cut your input costs by 80-90%. Experiment 1: Count the cost before building import tiktoken def estimate_cost(prompt: str, expected_output_words: int) -> dict: enc = tiktoken.encoding_for_model("gpt-4o") input_tokens = len(enc.encode(prompt)) output_tokens = int(expected_output_words * 1.33) # ~0.75 words per token input_cost = (input_tokens / 1_000_000) * 2.50 output_cost = (output_tokens / 1_000_000) * 10.00 return { "input_tokens": input_tokens, "output_tokens": output_tokens, "input_cost": f"${input_cost:.6f}", "output_cost": f"${output_cost:.6f}", "total_per_call": f"${input_cost + output_cost:.6f}", "daily_1000_calls": f"${(input_cost + output_cost) * 1000:.2f}" } # Test it result = estimate_cost( prompt="You are a helpful assistant. What are the best smart glasses?", expected_output_words=200 ) print(result) # → {'daily_1000_calls': '$2.67', ...} Experiment 2: Visualize generation timing import time start = time.time() first_token = True for chunk in client.chat.completions.create(model="gpt-4o", messages=[...], stream=True): if chunk.choices[0].delta.content: if first_token: print(f"\n⚡ TTFT: {time.time() - start:.2f}s") first_token = False print(chunk.choices[0].delta.content, end="", flush=True) Experiment 3: stream=True vs stream=False — feel the UX difference import time from openai import OpenAI client = OpenAI() prompt = [{"role": "user", "content": "Write a 3-paragraph summary of how LLMs work."}] # Without streaming — user waits for the full response start = time.time() response = client.chat.completions.create(model="gpt-4o", messages=prompt, stream=False) print(f"Non-streaming total wait: {time.time() - start:.2f}s") print(response.choices[0].message.content) # With streaming — user sees first token almost immediately print("\n--- Streaming version ---") start = time.time() first_token_time = None for chunk in client.chat.completions.create(model="gpt-4o", messages=prompt, stream=True): if chunk.choices[0].delta.content: if first_token_time is None: first_token_time = time.time() - start print(f"⚡ TTFT: {first_token_time:.2f}s") print(chunk.choices[0].delta.content, end="", flush=True) # Same total time — but perceived as much faster because content starts immediately