How I Built a Masking Tool Without Showing AI Any Real Data: Column-wise Shuffling as the Scaffold
TL;DR I never write code or send real data to LLMs — but I built a complete data-masking tool through AI collaboration. The technique: column-wise independent shuffling (Japan PPC's official anonymization method) plus Faker replacement. Four phases: send column names → run shuffling batch → manually craft sample CSV → send sample for Faker batch + structural review. Key discipline: survey naive ideas in industry terminology before having AI implement — that alone compresses code 10x. The output is a tool I trigger by double-click. I never read the Python. Across my field notes, I've kept saying the same things: But how exactly do I sanitize the data? I wanted to build a new masking tool. I wanted to discuss it But the rule is firm: no business data goes to LLMs. Just describing the logic verbally doesn't land — What I needed: data that looks real but can't identify anyone. My first idea was simple: "What if I shuffle each column independently?" If you shuffle each column on its own: Each value remains real (format perfectly preserved) Row-level combinations are destroyed (records can't be reconstructed) Per-column statistical properties are preserved (distributions intact) For 100 customer rows, shuffle the name column, address column, That should make individual identification impossible. But before implementing, I surveyed first. "Naive idea → immediate implementation" is forbidden discipline ops discipline in AI-assisted coding). Searching "column-wise shuffle + anonymization + technical term": Column-wise Independent Shuffling And surprisingly, Japan codifies it too: Japan's Personal Information Protection Commission (PPC) lists So my naive idea was literally PPC's official method. A premise I should make explicit — I don't write a single line of code. But the rule "no business data to LLMs" applies, so I can't just send I can't send the real data, but I can send the column names Prompt to LLM: Schema: customerID / name / address / building name / company / amount The LLM produced a batch file + internal script + input/output folders a tool that runs on double-click. I drop real data into the input folder, double-click the batch file, Something's off. The shuffle is supposedly happening, but row-level I report to the LLM: The double-click ran fine, but the output CSV doesn't look shuffled. The LLM's instant reply: The internal seed is shared across all columns. We need a different I receive the fixed batch file, double-click → combinations are now What looked correct on paper failed in practice. practical verification is the human's role. From the shuffled output, I pull just 10 rows and manually replace the The sample CSV now has only the column structure and shape of data — Only at this point does it become material Phase 4: Send the Sample CSV → Get the Faker Batch Built I send the sample CSV to the LLM with a follow-up request: Based on this sample, add a Faker-based replacement step for The LLM integrated Faker (ja_JP locale, but the same applies in any While reading the sample, the LLM also notices: Your "product" keyword rule for Faker-replacement is over-matching: This wasn't a perspective I would have spotted alone. I use AI twice — once for the shuffling batch (built from column The LLM rewrote the matching logic from "keyword-match → apply" into Phase What I do What AI does Phase 1: Build shuffling batch Send column names as prompt Build the complete batch tool Phase 2: Verify operation → Fix bug Click / verify in Excel / report Identify bug cause and fix Phase 3: Build sample CSV Pull 10 rows / manually edit surnames and building names (not involved) Phase 4: Build Faker batch Send sample CSV to LLM / click / verify Build Faker batch + structural review (resolve over-match) I never read the Python. I never send real data to LLMs. Double-click → open in Excel → report to AI. Four phases through This is what AI collaboration looks like. A brief touch on the legal positioning. Internal use (LLM discussion / internal analysis) is generally fine. Handling client data as a contractor is where it gets tricky. Japan classifies this as "entrusted processing" under Article 27(5)(i) of the Personal Information Protection Act, an exception to third-party transfer rules. EU/UK treats it as a Data Processor / Data Controller relationship under GDPR, with a Data Processing Agreement (DPA) under Article 28 specifying the processing scope. US uses HIPAA's Business Associate Agreement (BAA) for healthcare data, or contractual data-handling clauses for general PII. The common pattern: contract language determines compliance. In short: contracts get complicated, so legal review is recommended What the AI collaboration era needs is a scaffold tool that converts "data you can't share" into "samples you can share." And the post's thesis: Don't have AI implement your naive idea immediately — That discipline compresses code volume by 10x. If I hadn't known that "column-wise shuffle" is PPC's official Survey-driven discipline. In the AI collaboration era, what matters the ability to cross-reference industry knowledge.
