I Wrote a Python Interpreter in Python. What I Learned Has Nothing to Do With Python

DEV Community

Juan Torchia

Apr 19, 2026, 10:51 PM

150 points on Hacker News. That number stopped my scroll at 11pm on a Tuesday. The title was simple: "Writing a Python interpreter in Python". I opened it expecting another toy tutorial. I closed my laptop at 2am having written a lexer, a parser, and a working evaluator — and with a question I didn't see coming: why do I now have a better feel for when Claude is lying to me? That's what this post is about. The original post is solid. The core idea is that Python is expressive enough to model its own evaluation structures. You can represent an AST with dataclasses, walk it with pattern matching, and have a functional REPL loop in under 500 lines. It's not CPython. It doesn't handle every edge case. But it works, and that's exactly the point. I started following the post line by line. Then I started diverging. And the divergence is where things get interesting. # Step 1: The Lexer — converts text into tokens from dataclasses import dataclass from enum import Enum, auto from typing import Iterator class TokenType(Enum): NUMBER = auto() STRING = auto() NAME = auto() # variables, functions PLUS = auto() MINUS = auto() MULT = auto() DIV = auto() EQUAL = auto() # = EQUAL_EQUAL = auto() # == LPAREN = auto() RPAREN = auto() DEF = auto() RETURN = auto() IF = auto() ELSE = auto() NEWLINE = auto() EOF = auto() @dataclass class Token: type: TokenType value: str | int | float | None line: int def lexer(source: str) -> Iterator[Token]: """Converts source code string into a stream of tokens""" keywords = { 'def': TokenType.DEF, 'return': TokenType.RETURN, 'if': TokenType.IF, 'else': TokenType.ELSE, } i = 0 line = 1 while i < len(source): c = source[i] # Skip spaces if c == ' ': i += 1 continue # Track newlines if c == '\n': yield Token(TokenType.NEWLINE, None, line) line += 1 i += 1 continue # Numbers if c.isdigit(): start = i while i < len(source) and (source[i].isdigit() or source[i] == '.'): i += 1 value = source[start:i] yield Token( TokenType.NUMBER, float(value) if '.' in value else int(value), line ) continue # Identifiers and keywords if c.isalpha() or c == '_': start = i while i < len(source) and (source[i].isalnum() or source[i] == '_'): i += 1 text = source[start:i] type_ = keywords.get(text, TokenType.NAME) yield Token(type_, text, line) continue # Operators if c == '=' and i + 1 < len(source) and source[i+1] == '=': yield Token(TokenType.EQUAL_EQUAL, '==', line) i += 2 continue operators = { '+': TokenType.PLUS, '-': TokenType.MINUS, '*': TokenType.MULT, '/': TokenType.DIV, '=': TokenType.EQUAL, '(': TokenType.LPAREN, ')': TokenType.RPAREN, } if c in operators: yield Token(operators[c], c, line) i += 1 continue i += 1 # ignore unknown characters for now yield Token(TokenType.EOF, None, line) This is the lexer. The part most tutorials skip or abstract away behind re. I wrote it by hand because I wanted to feel every decision. # Step 2: The AST — the structure that represents the program @dataclass class Node: pass @dataclass class NumberNode(Node): value: int | float @dataclass class NameNode(Node): name: str @dataclass class BinOpNode(Node): left: Node op: str right: Node @dataclass class AssignNode(Node): name: str value: Node @dataclass class FuncDefNode(Node): name: str params: list[str] body: list[Node] @dataclass class CallNode(Node): func: str args: list[Node] @dataclass class ReturnNode(Node): value: Node # Step 3: The Evaluator — where things actually happen class Environment: """Scope: local variables + access to parent scope""" def __init__(self, parent=None): self.variables = {} self.parent = parent def get(self, name: str): if name in self.variables: return self.variables[name] if self.parent: return self.parent.get(name) raise NameError(f"Name '{name}' is not defined") def set(self, name: str, value): self.variables[name] = value def evaluate(node: Node, env: Environment): """Walks the AST and executes each node""" match node: case NumberNode(value): return value case NameNode(name): return env.get(name) case BinOpNode(left, op, right): # Evaluate operands first v_left = evaluate(left, env) v_right = evaluate(right, env) match op: case '+': return v_left + v_right case '-': return v_left - v_right case '*': return v_left * v_right case '/': return v_left / v_right case '==': return v_left == v_right case AssignNode(name, value): result = evaluate(value, env) env.set(name, result) return result case FuncDefNode(name, params, body): # Store the function as data — simple closures env.set(name, (params, body, env)) return None case CallNode(func, args): params, body, def_env = env.get(func) # Create a new scope for the function call local_env = Environment(parent=def_env) for param, arg in zip(params, args): local_env.set(param, evaluate(arg, env)) result = None for stmt in body: result = evaluate(stmt, local_env) return result case ReturnNode(value): return evaluate(value, env) Three files. ~300 lines. Functional enough to evaluate simple functions with recursion. Here's the weird part. And it's the real reason I'm writing this. When I finished the basic evaluator, I asked Claude to help me add proper closure support — the case where an inner function captures variables from an outer scope. The response was instant, confident, and partially wrong. Not obviously wrong. Subtly wrong: it was conflating the definition environment with the execution environment. In a language with dynamic scope evaluation (like Bash, or pre-lexical Emacs Lisp), that answer would've been correct. For Python, which has lexical scoping, it was wrong. And I spotted it immediately. Because I had just written def_env by hand. I knew exactly what it meant to capture that environment in FuncDefNode. Before this exercise, would I have caught it? Probably not. I would've pasted the code, run whatever tests I had lying around, and moved on. This connects to something I got into in my post about how many tokens I actually burn per real task: the cost isn't just financial. It's cognitive. Every time you delegate without understanding the layer underneath, you're paying with your ability to detect errors. The LLM isn't lying to you out of malice. It's giving you the most statistically likely answer given the context. If dynamic scope evaluation was the dominant pattern in its training data, that's what you're going to get. The abstraction layer makes sure you don't notice. I also saw this with CodeBurn from a different angle: when you know exactly what the code needs to do, your prompts are sharper, your corrections are faster, and the iteration loop shrinks. Not because the model got better — because you became a better collaborator. Mistake 1: Confusing the parser with the evaluator. I started putting evaluation logic inside the parser. "It's easier to just do the addition here while I'm parsing the +." Technically functional, architecturally a disaster. Two hours later I understood why the separation exists. Not as dogma — because when you want static analysis, optimizations, or even just decent debugging, you need a clean AST. The LLM would never have made that mistake for me. It would've handed me separated code from the start. And I never would've learned why they're separated. Mistake 2: Wanting error handling before I had working functionality. Halfway through the lexer I decided I wanted beautiful error messages with line numbers, column numbers, and context snippets. Three hours later I had a gorgeous error system for a lexer that still didn't work. Classic. Mistake 3: Not having a minimal test case from day one. I started writing without knowing what I wanted to work first. The fix was simple: # The minimal test that should work from day 1 TEST_CODE = """ def add(a, b): return a + b result = add(3, 4) """ # If this works, the interpreter exists. # Everything else is a feature. Having that contract clear from the beginning organizes everything else. It's what in the agent world we call "verification" — the same principle I explored with SPICE and Claude Code: it's not enough for the agent to generate something, there has to be an external verification layer. What's the difference between an interpreter and a compiler? A compiler translates source code into another representation (bytecode, machine code) before executing it. An interpreter executes it directly, usually by walking the AST or evaluating some intermediate representation. CPython is technically a bytecode interpreter: it first compiles to .pyc, then executes that bytecode in a virtual machine. What I built here is simpler: it evaluates the AST directly, no intermediate step. Does this have any practical application or is it just an exercise? More practical than it looks. The DSLs (Domain Specific Languages) that show up in infrastructure configuration, business rules, and template systems use exactly this architecture. If you've ever worked with expressions in Jinja2, rules in Drools, or filters in Elasticsearch — you were using something built on these exact ideas. Why write the lexer by hand instead of using re? Same reason you learn long multiplication before using a calculator. The goal wasn't to have a production lexer. It was to understand what decisions a lexer actually makes. Libraries like PLY or lark do this better and faster — but if you don't understand what they're doing, you also won't understand the error messages when they blow up. What does any of this have to do with using LLMs to write code? Everything. The LLM generates code that's statistically plausible given the context. If you don't understand the semantics of what you asked for, you can't verify whether what you received is correct. It's not that LLMs are bad — it's that verification requires comprehension. The deeper you've gone into the abstractions, the easier it is to tell when the output makes sense and when it doesn't. I get into this more in the post about the real cost per task in Claude Code. Do you need compiler theory to do this? No. The original HN post doesn't assume any prior knowledge of compilers, and I didn't have any when I started either. I took Automata and Compilers courses later in my CS degree — and when I did, I retroactively understood why things worked the way they did. You can start with the code and the theory follows naturally if the curiosity is there. How long does it take to build something like this from scratch? The minimal working interpreter (lexer + recursive descent parser + evaluator with functions) took me a weekend — with interruptions, coffee breaks, and a couple of dead ends. If you follow the reference post in order, probably less. The real learning time isn't the writing: it's the moments where something doesn't work and you have to figure out why. Building abstraction layers from the bottom up doesn't make you faster. It makes you more precise. In a context where frontier models keep getting more expensive and agents run on distributed infrastructure, the ability to verify AI output isn't a nice-to-have. It's the difference between using a tool and being used by one. I passed Analysis II on my fourth attempt. Not because I'm bad at math — because in the first three attempts I was following steps without understanding structures. I passed on the fourth attempt when I stopped memorizing and started building intuition from definitions. The interpreter taught me the same thing, but about code and about AI. If you want the full repo with the code from this post, send me a message. And if you've built something like this and arrived at different conclusions — especially if you think I'm wrong about the LLM part — I'm genuinely interested in that conversation. That's the conversation worth having.