PDF Is Still the Hardest File Format to Work With. Here's Why.
All tests run on an 8-year-old MacBook Air. I've spent months building a PDF tool. PDF is the most infuriating file format I've ever worked with. Not because it's poorly designed. Because it's been accumulating complexity since 1993 and the spec is 750 pages long. PDF files contain operators, operands, a stack-based execution model, and a coordinate system. A "page" is a program that draws itself. BT % Begin text /F1 12 Tf % Set font F1 at 12pt 100 700 Td % Move to position (Hello, World.) Tj % Draw text string ET % End text That's not markup. That's code. Parsing it correctly requires implementing an interpreter. Fonts in PDF come in 9 different subtypes. Each has different encoding rules. Type1, TrueType, OpenType, CIDFont Type0, CIDFont Type2, Type3, MMType1, CIDFontType0C, CIDFontType2... A scanned PDF from a 2003 copier might use a custom encoding table that maps byte values to glyph IDs in a way that made sense to that specific device in 2003. Text extraction from these files produces garbage unless you implement the full encoding resolution chain. Most libraries don't. PDF uses an incremental update model. When you edit a PDF, the original objects stay in the file. New objects are appended, and the XREF table is updated to point to the new versions. The old versions are still in the bytes. They're just not indexed. [Original PDF bytes] [Updated object - new version of page 1] [New XREF table pointing to updated objects] %%EOF A forensic tool (or a hex editor) can read the original content. This is why "save as" doesn't erase metadata. PDF viewers are so permissive that malformed PDFs became common. Files that violate the spec in multiple ways open fine in Adobe Reader, so nobody fixed them. Now every PDF parser has to handle malformed files gracefully, because real-world PDFs are full of spec violations. lopdf will fail on files that Adobe Reader opens without complaint. You end up writing recovery code for files that technically shouldn't exist. It's a reason to understand what you're getting into before you start. The complexity is why good PDF tools are worth building — and why nobody has fully solved it yet. Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault @hiyoyok
