AI News Hub Logo

AI News Hub

Most PDF Redaction Is Broken. Here's What "Real" Redaction Actually Requires.

DEV Community
hiyoyo

All tests run on an 8-year-old MacBook Air. Drawing a black rectangle over text is not redaction. The text is still there. Select all, copy, paste into Notepad — it appears. This has leaked classified documents from actual government agencies. Multiple times. Real redaction destroys the underlying data. Here's how I implemented it. PDF structure (fake redaction): Page content stream: "Salary: $120,000" ← still here Annotation layer: [black rectangle] ← just covering it The content stream is untouched. Any PDF parser can read it. Find the target text in the content stream Remove it from the stream entirely Replace with a filled black rectangle drawn directly into the content Re-serialize the page — no original data survives pub fn redact_text( doc: &mut Document, page_id: ObjectId, target: &str, ) -> Result { let page = doc.get_object_mut(page_id)?; if let Ok(stream) = page.as_stream_mut() { let content = stream.decode_content()?; // Remove text operators containing target let cleaned = remove_text_from_content(content, target); // Replace with black filled rectangle at same position let redact_op = format!( "q 0 0 0 rg {} {} {} {} re f Q\n", x, y, width, height ); stream.set_content(cleaned + redact_op.as_bytes()); } Ok(()) } PDF content streams don't store text with coordinates in a simple format. Text position is determined by the current transformation matrix, text matrix, and font metrics — all stateful. Parsing this correctly requires a proper content stream interpreter, not a regex over the raw bytes. lopdf gives you the raw stream. Interpreting it is your job. For auto-detection of PII (names, phone numbers, ID numbers), I run a pattern-matching pass before redaction: pub fn detect_pii(text: &str) -> Vec { let mut findings = Vec::new(); // Phone numbers let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap(); for m in phone_re.find_iter(text) { findings.push((m.start(), m.end(), PiiType::Phone)); } // Japanese My Number (12 digits) let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap(); for m in mynumber_re.find_iter(text) { findings.push((m.start(), m.end(), PiiType::MyNumber)); } findings } The user reviews detections before committing — auto-redaction without review is its own kind of risk. Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault @hiyoyok