AI News Hub Logo

AI News Hub

Cloudflare Workers HTML to Markdown on the Free Plan

DEV Community
Rick Cogley

This is a condensed version. Full article on cogley.jp has the complete code walkthrough, known limits of the starter emitter, and the full reasoning for each alternative. AI crawlers — Gemini, GPT, Claude, Perplexity — read your site constantly, and they'd rather parse markdown than HTML. Markdown means cleaner context, fewer tokens, cheaper inference. If your content is already markdown (CMS, Git, database), you just negotiate the format with Accept: text/markdown. But if your content is HTML — you're proxying a third-party page, mirroring docs, building a reader-mode endpoint, feeding an LLM summarizer, or simply serving a static website — you'd have to convert it to markdown inside a Cloudflare Worker. On the free plan that means 10 ms CPU and 1 MB compressed bundle — which strategy survives those constraints? Worth clarifying upfront because Cloudflare uses "paid" for two separate products: Workers Paid ($5/month + usage) — the Worker runtime upgrade. CPU jumps from 10 ms to 30 s, bundle from 1 MB to 10 MB. This plan is what changes the HTML-to-markdown calculus. Cloudflare Pro ($20/month per domain) — a domain plan. Adds WAF, image optimization, page rules. Does not change any Worker limit. Throughout this article, "paid" means Workers Paid. Limit Free Paid CPU time per request 10 ms 30 s (Standard) Compressed Worker bundle 1 MB 10 MB Ten ms is plenty for routing or JSON work. HTML-to-markdown is different — you're parsing a DOM, walking every node, emitting a transformed string. It's CPU-dense, and any strategy that ships its own DOM implementation tends to blow the bundle budget too. HTMLRewriter is built into workerd — the open-source JavaScript/Wasm runtime that executes your Worker at the edge (and what wrangler dev runs locally). Zero npm dependencies. Used by Cloudflare themselves for response transformation. Architectural distinction. HTMLRewriter is streaming and SAX-style: it fires / text / events as bytes arrive and never builds an in-memory tree. The turndown / Readability / cheerio family do the opposite — buffer the whole document, construct a DOM with every node and parent pointer allocated, then walk it. That construction pass is both a CPU tax (before you emit a single markdown character) and the reason those libraries ship their own DOM implementation (hundreds of KB of bundle). On a sample 34 KB HTML article: Bundle: 10.52 KiB uncompressed / 3.74 KiB gzipped (0.4% of the 1 MB budget) CPU: 2 ms median over 50 runs (min 2, max 8) — 20% of the 10 ms budget Output: 24.9 KB markdown That's 5× CPU headroom and 250× bundle headroom on free. Nothing else I measured came close. Numbers are from wrangler dev local workerd — the edge runtime is typically 1.5–2× slower, so plan on 3–4 ms realistic median. Still well inside 10 ms. Strategy Bundle CPU estimate Fits free budget? HTMLRewriter ~11 KB 2 ms ✅ Huge headroom node-html-parser ~40 KB Fast ✅ Good fallback cheerio + custom 100–150 KB Moderate △ Tight, no upside turndown + domino shim ~320 KB 15–30 ms ❌ Busts CPU budget Readability + turndown ~400 KB 20–40 ms ❌ Busts both jsdom ~2 MB — ❌ Twice the bundle budget The turndown / Readability stack is the canonical "HTML to markdown" setup, and it produces clean output. It just doesn't fit on the free plan — the 240 KB @mixmark-io/domino shim alone eats a third of your bundle before you write a single line. Use it on paid. Insert markdown punctuation as text adjacent to each matched element, then strip everything else with a catch-all * selector: const rewriter = new HTMLRewriter() .on('head, nav, aside, footer, script, style, figure', { element(el) { el.remove(); }, }) .on('h1', { element(el) { el.before('\n\n# ', { html: false }); el.after('\n\n', { html: false }); }, }) .on('h2', { element(el) { el.before('\n\n## ', { html: false }); el.after('\n\n', { html: false }); }, }) .on('li', { element(el) { el.before('\n- ', { html: false }); }, }) .on('a', { element(el) { const href = (el.getAttribute('href') || '').replace(/\s+/g, ''); el.before('[', { html: false }); el.after(`](${href})`, { html: false }); }, }) .on('*', { element(el) { el.removeAndKeepContent(); }, }); const raw = await rewriter .transform(new Response(html, { headers: { 'content-type': 'text/html' } })) .text(); Post-process for HTML entities and whitespace normalization (full code in the cogley.jp article). HTMLRewriter selectors fire independently, so cross-element state is awkward: Ordered lists come out unnumbered (no parent lookup for list index) Inline inside drops its fences (no parent-type check) Link text spanning formatting tags loses emphasis when * strips These matter for round-tripping. For "give AI agents clean markdown" they don't. The harness that produced these numbers is a separate public repo: cf-workers-html-to-markdown-harness. It's a measurement rig, not a library. git clone https://github.com/RickCogley/cf-workers-html-to-markdown-harness cd cf-workers-html-to-markdown-harness npm install --ignore-scripts npm run dev curl 'http://127.0.0.1:8791/bench?strategy=htmlrewriter&runs=50' Adding a new strategy (cheerio, node-html-parser, turndown+domino) is one handler file under src/handlers/ plus one line in src/index.ts — see ADD_A_STRATEGY.md for the full pattern. If you measure something I didn't and get different numbers, open an issue on the harness repo and I'll update the article. Need any of these? Stop trying to stay on free: Round-trippable markdown → turndown Article extraction (skip nav/sidebar/comments) → Readability HTML tables → markdown tables → turndown or cheerio Larger or more varied inputs → paid's 30 s budget removes the ceiling Workers Paid is $5/month + usage. It's cheaper than an afternoon of engineering around the free-plan budgets if your use case needs a fuller converter. Full version with complete code, reasoning, and the harness deep-dive: cogley.jp. Rick Cogley is founder/CEO of eSolia Inc. in Tokyo. Originally published at cogley.jp Rick Cogley is CEO of eSolia Inc., providing bilingual IT outsourcing and infrastructure services in Tokyo, Japan.