AI News Hub Logo

AI News Hub

Why does paying more make your LLM reply faster?

DEV Community
Ashwin Hariharan

Why does Claude respond faster when you pay more? And why does a longer conversation cost disproportionately more than a short one? For the longest time I simply accepted these as "it's just how it works". Like most engineers, I burn through Claude and GPT tokens all day and assumed "longer prompts cost more" was just a billing convention. As it turns out, memory is one of the factors that influence LLM pricing. Now memory in AI systems lives in a lot of places. vector stores for RAG, Redis for semantic caches and session state, in process caches for short-lived data. Each layer has its own latency budget and its own access pattern. One layer that doesn't get talked about much, but quietly determines almost every LLM pricing decision from Claude, GPT, and Gemini, is HBM - the high-bandwidth memory inside the GPU itself. At every token generation phase, the GPU does two reads from this high-bandwidth memory: reading the model's weights reading the KV cache Let's unpack each briefly. Every time the model generates a token, your input flows through the model's layers one by one - from the first layer all the way to the output. This is called a forward pass. Each forward pass reads the model weights just once, regardless of how many users are calling the API at the same moment. The weights are constant; they don't change between users. This means the cost of that one weight read can be split. If the GPU packs 100 user requests into the same forward pass as a batch, those 100 users share the single weight read. The cost is split amongst 100 users. Basically, it means that the "fast tier" modes in tools like Cursor are smaller batches (fewer people splitting the bill) - so you pay more per token. KV Cache The KV cache works differently. It is a variable cost that grows with your conversation. When the model generates a new token, it doesn't treat every earlier token equally. It uses attention to decide which earlier tokens matter most. The easiest way to picture it: imagine every token in your conversation is a sticky note with two parts. The key is a short tag describing what kind of information this token carries. The value is what's written inside the note — the actual information the model can pull in. Take the sentence: "The cat sat on the mat. It was fluffy." When the model gets to "It was fluffy" and tries to predict the next word, it needs to know what "It" refers to. So it scans the tabs (keys) of every earlier token: cat: key indicates "I'm a noun, an animal, the subject". Value carries "small, furry, four legs, often a pet." mat: key indicates "I'm a noun, an object, a location". Value carries "flat thing on the floor." Both are nouns, but the cat key matches the question "what could 'It' refer to that could be fluffy?" better. So the model pulls in cat's value more strongly than mat's, and uses that to shape the next token. Note: In reality keys and values aren't English sentences - they're vectors of numbers the model learned during training. But functionally that's the job they do: the key is how this token gets found, the value is what gets pulled in once it's found. For every token in your conversation, the model saves a key (a searchable label) and a value (the content). Without the cache, the attention mechanism would recompute these from scratch on every step. With it, it just reads them back. But that read grows linearly with every conversation: 1,000 tokens of context -> 1,000 key-value pairs read per generated token 100,000 tokens of context -> 100,000 key–value pairs read per generated token And unlike weights, this cache is unique to your session - the GPU can't read user A's KV cache and reuse it for user B, because the data is different. Every user pays the full cost of reading their own KV cache, with no sharing. 💡 So under the hood, it's about how fast a chip can read memory. The weight bill gets split across the batch, whereas the KV bill is just yours. The math behind how LLMs are trained and served