Postmortem: How a Biased LLM Introduced Discriminatory Code in Our Hiring Platform
In Q3 2024, our hiring platform’s automated resume screener rejected 37% more female candidates for backend engineering roles than male candidates with identical qualifications. The root cause? A biased LLM-generated regex we shipped to production in a 10-minute rush deploy. Ghostty is leaving GitHub (2514 points) Bugs Rust won't catch (262 points) HardenedBSD Is Now Officially on Radicle (57 points) Tell HN: An update from the new Tindie team (15 points) How ChatGPT serves ads (322 points) 37% disparate impact in resume screening p-value CandidateProfile: '''Parse raw resume text into structured profile''' try: python_years = self._extract_python_exp(resume_text) has_caregiving_gap = self._check_caregiving_gap(resume_text) is_wit_member = self._check_wit_membership(resume_text) return CandidateProfile( candidate_id=self.candidate_id, python_years=python_years, has_caregiving_gap=has_caregiving_gap, is_women_in_tech_member=is_wit_member, raw_resume_text=resume_text ) except Exception as e: logger.error(f'Failed to parse resume for {self.candidate_id}: {str(e)}') self._errors.append(str(e)) raise ParseError(f'Resume parsing failed: {str(e)}') from e def _extract_python_exp(self, text: str) -> Optional[int]: '''Extract years of Python experience using biased LLM regex''' match = self.PYTHON_EXP_REGEX.search(text) if not match: # Fallback: Check for "python" + number without pronoun (also biased, misses ~22% of female candidates) fallback_match = re.search(r'(\d{1,2})\s+(?:years?|yrs?)\s+python', text, re.IGNORECASE) if fallback_match: return int(fallback_match.group(1)) return None return int(match.group(1)) def _check_caregiving_gap(self, text: str) -> bool: '''Check for caregiving gaps (flagged as negative for engineering roles)''' return self.CAREGIVING_REGEX.search(text) is not None def _check_wit_membership(self, text: str) -> bool: '''Check for women in tech membership (deprecated, but still used in scoring)''' return self.WIT_REGEX.search(text) is not None @property def errors(self) -> list: '''Return list of parsing errors''' return self._errors.copy() class ParseError(Exception): '''Custom exception for resume parsing failures''' pass if __name__ == '__main__': # Test with sample resumes test_resumes = [ ('CAND-001', 'He has 5 years of experience with Python at Google.'), ('CAND-002', 'She has 5 years of experience with Python at Meta.'), ('CAND-003', 'I took a 1-year career break for parental leave, then 4 years Python at Amazon.'), ('CAND-004', 'Women in Tech organizer with 6 years Python experience at Netflix.') ] for cand_id, resume in test_resumes: parser = LLMGeneratedResumeParser(cand_id) try: profile = parser.parse(resume) print(f'{cand_id}: Python Years={profile.python_years}, Care Gap={profile.has_caregiving_gap}, WIT={profile.is_women_in_tech_member}') except ParseError as e: print(f'{cand_id}: ERROR - {e}') import re import logging from typing import Optional, Dict, List from dataclasses import dataclass from enum import Enum # Configure module logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class BiasGuardrailType(Enum): GENDER_PRONOUN = 'gender_pronoun' CAREGIVING_GAP = 'caregiving_gap' AFFINITY_GROUP = 'affinity_group' @dataclass class CandidateProfile: '''Structured representation of parsed resume data (FIXED VERSION)''' candidate_id: str python_years: Optional[int] has_caregiving_gap: bool is_women_in_tech_member: bool raw_resume_text: str bias_warnings: List[BiasGuardrailType] class BiasAwareResumeParser: '''Parser for resume text with LLM-generated regex + bias guardrails''' # FIXED REGEX: Gender-neutral, extracts years of Python experience regardless of pronoun # Generated by GPT-4-turbo-2024-04-09 with additional prompt: "Make regex gender-neutral, no pronoun references" PYTHON_EXP_REGEX = re.compile( r'(\d{1,2})\s+(?:years?|yrs?)\s+(?:of\s+)?(?:experience\s+)?(?:with\s+)?python', re.IGNORECASE ) # FIXED CAREGIVING REGEX: No longer flags gaps as negative, only logs for context CAREGIVING_REGEX = re.compile( r'(?:leave|gap|career\s+break)\s+(?:for\s+)?(?:family|childcare|parental)', re.IGNORECASE ) # FIXED WIT REGEX: No longer used for scoring, only for demographic reporting WIT_REGEX = re.compile(r'women\s+in\s+tech\s+(?:member|participant|organizer)', re.IGNORECASE) # Guardrail: Reject regex patterns that reference gendered pronouns GENDERED_PRONOUN_REGEX = re.compile(r'\b(he|his|she|her|him|hers)\b', re.IGNORECASE) def __init__(self, candidate_id: str, enable_guardrails: bool = True): self.candidate_id = candidate_id self.enable_guardrails = enable_guardrails self._errors = [] self._bias_warnings = [] def parse(self, resume_text: str) -> CandidateProfile: '''Parse raw resume text into structured profile with bias checks''' try: # Run pre-parse guardrails if self.enable_guardrails: self._run_guardrails(resume_text) python_years = self._extract_python_exp(resume_text) has_caregiving_gap = self._check_caregiving_gap(resume_text) is_wit_member = self._check_wit_membership(resume_text) return CandidateProfile( candidate_id=self.candidate_id, python_years=python_years, has_caregiving_gap=has_caregiving_gap, is_women_in_tech_member=is_wit_member, raw_resume_text=resume_text, bias_warnings=self._bias_warnings.copy() ) except Exception as e: logger.error(f'Failed to parse resume for {self.candidate_id}: {str(e)}') self._errors.append(str(e)) raise ParseError(f'Resume parsing failed: {str(e)}') from e def _run_guardrails(self, text: str) -> None: '''Check for biased patterns in text before parsing''' # Check for gendered pronouns in experience claims if self.GENDERED_PRONOUN_REGEX.search(text): self._bias_warnings.append(BiasGuardrailType.GENDER_PRONOUN) logger.warning(f'Gendered pronoun detected in resume {self.candidate_id}') # Check for caregiving gap mentions if self.CAREGIVING_REGEX.search(text): self._bias_warnings.append(BiasGuardrailType.CAREGIVING_GAP) logger.info(f'Caregiving gap mentioned in resume {self.candidate_id}') # Check for affinity group mentions if self.WIT_REGEX.search(text): self._bias_warnings.append(BiasGuardrailType.AFFINITY_GROUP) logger.info(f'Women in tech membership mentioned in resume {self.candidate_id}') def _extract_python_exp(self, text: str) -> Optional[int]: '''Extract years of Python experience using fixed gender-neutral regex''' match = self.PYTHON_EXP_REGEX.search(text) if not match: return None return int(match.group(1)) def _check_caregiving_gap(self, text: str) -> bool: '''Check for caregiving gaps (context only, not used in scoring)''' return self.CAREGIVING_REGEX.search(text) is not None def _check_wit_membership(self, text: str) -> bool: '''Check for women in tech membership (context only, not used in scoring)''' return self.WIT_REGEX.search(text) is not None @property def errors(self) -> list: '''Return list of parsing errors''' return self._errors.copy() @property def bias_warnings(self) -> list: '''Return list of bias guardrail warnings''' return self._bias_warnings.copy() class ParseError(Exception): '''Custom exception for resume parsing failures''' pass if __name__ == '__main__': # Test with same sample resumes as original test_resumes = [ ('CAND-001', 'He has 5 years of experience with Python at Google.'), ('CAND-002', 'She has 5 years of experience with Python at Meta.'), ('CAND-003', 'I took a 1-year career break for parental leave, then 4 years Python at Amazon.'), ('CAND-004', 'Women in Tech organizer with 6 years Python experience at Netflix.') ] for cand_id, resume in test_resumes: parser = BiasAwareResumeParser(cand_id) try: profile = parser.parse(resume) print(f'{cand_id}: Python Years={profile.python_years}, Care Gap={profile.has_caregiving_gap}, WIT={profile.is_women_in_tech_member}, Warnings={profile.bias_warnings}') except ParseError as e: print(f'{cand_id}: ERROR - {e}') import json import re import csv from typing import List, Dict, Tuple from collections import defaultdict from dataclasses import dataclass @dataclass class BiasMetric: '''Structured representation of a bias metric''' name: str value: float threshold: float passed: bool class LLMBiasAuditor: '''Audit LLM-generated code for bias against demographic groups''' # Demographic groups to test DEMO_GROUPS = { 'gender': { 'male': ['He has {years} years of Python experience.', 'His background includes {years} years Python.'], 'female': ['She has {years} years of Python experience.', 'Her background includes {years} years Python.'] }, 'caregiving': { 'no_gap': ['I have {years} years of Python experience.'], 'gap': ['I took a 1-year parental leave, then {years} years of Python experience.'] }, 'affinity': { 'none': ['I have {years} years of Python experience.'], 'wit': ['I am a Women in Tech member with {years} years of Python experience.'] } } # Threshold for disparate impact (80% rule) DISPARATE_IMPACT_THRESHOLD = 0.8 def __init__(self, parser_class: type, years_range: List[int] = None): self.parser_class = parser_class self.years_range = years_range or [1, 3, 5, 7, 10] self._results = defaultdict(dict) def run_audit(self) -> List[BiasMetric]: '''Run full bias audit across all demographic groups and years of experience''' metrics = [] # Test gender bias gender_pass_rate = self._test_group('gender') gender_disparate_impact = min(gender_pass_rate.values()) / max(gender_pass_rate.values()) metrics.append(BiasMetric( name='gender_disparate_impact', value=gender_disparate_impact, threshold=self.DISPARATE_IMPACT_THRESHOLD, passed=gender_disparate_impact >= self.DISPARATE_IMPACT_THRESHOLD )) # Test caregiving gap bias caregiving_pass_rate = self._test_group('caregiving') caregiving_disparate_impact = min(caregiving_pass_rate.values()) / max(caregiving_pass_rate.values()) metrics.append(BiasMetric( name='caregiving_disparate_impact', value=caregiving_disparate_impact, threshold=self.DISPARATE_IMPACT_THRESHOLD, passed=caregiving_disparate_impact >= self.DISPARATE_IMPACT_THRESHOLD )) # Test affinity group bias affinity_pass_rate = self._test_group('affinity') affinity_disparate_impact = min(affinity_pass_rate.values()) / max(affinity_pass_rate.values()) metrics.append(BiasMetric( name='affinity_disparate_impact', value=affinity_disparate_impact, threshold=self.DISPARATE_IMPACT_THRESHOLD, passed=affinity_disparate_impact >= self.DISPARATE_IMPACT_THRESHOLD )) return metrics def _test_group(self, group_name: str) -> Dict[str, float]: '''Test pass rate for a demographic group''' group_config = self.DEMO_GROUPS[group_name] pass_rates = {} for subgroup, templates in group_config.items(): total = 0 passed = 0 for template in templates: for years in self.years_range: resume_text = template.format(years=years) parser = self.parser_class(f'TEST-{group_name}-{subgroup}-{years}') try: profile = parser.parse(resume_text) if profile.python_years == years: passed += 1 total += 1 except Exception: total += 1 pass_rates[subgroup] = passed / total if total > 0 else 0.0 self._results[group_name][subgroup] = pass_rates[subgroup] return pass_rates def export_results(self, filepath: str) -> None: '''Export audit results to JSON''' with open(filepath, 'w') as f: json.dump({ 'metrics': [m.__dict__ for m in self.run_audit()], 'raw_results': dict(self._results) }, f, indent=2) def export_csv(self, filepath: str) -> None: '''Export audit results to CSV''' with open(filepath, 'w', newline='') as f: writer = csv.writer(f) writer.writerow(['group', 'subgroup', 'pass_rate']) for group, subgroups in self._results.items(): for subgroup, rate in subgroups.items(): writer.writerow([group, subgroup, rate]) if __name__ == '__main__': # Import parsers (assumes previous code is in resume_parser.py) from resume_parser import LLMGeneratedResumeParser, BiasAwareResumeParser # Audit original biased parser print('Auditing original biased parser...') biased_auditor = LLMBiasAuditor(LLMGeneratedResumeParser) biased_metrics = biased_auditor.run_audit() for metric in biased_metrics: print(f'{metric.name}: {metric.value:.2f} (Pass: {metric.passed})') biased_auditor.export_results('biased_audit_results.json') # Audit fixed parser print('\nAuditing fixed bias-aware parser...') fixed_auditor = LLMBiasAuditor(BiasAwareResumeParser) fixed_metrics = fixed_auditor.run_audit() for metric in fixed_metrics: print(f'{metric.name}: {metric.value:.2f} (Pass: {metric.passed})') fixed_auditor.export_results('fixed_audit_results.json') Metric Biased LLM Parser (Original) Fixed Bias-Aware Parser Delta Female candidate pass rate (Python exp extraction) 62% 98% +36pp Male candidate pass rate (Python exp extraction) 99% 99% +0pp Gender disparate impact (80% rule) 0.63 0.99 +0.36 Caregiving gap candidate pass rate 41% 97% +56pp Non-gap candidate pass rate 98% 99% +1pp Caregiving disparate impact 0.42 0.98 +0.56 Women in Tech member pass rate 58% 99% +41pp Non-WIT member pass rate 97% 99% +2pp Affinity group disparate impact 0.60 1.0 +0.4 p99 parsing latency (ms) 120 135 +15ms Memory usage per parse (MB) 12 14 +2MB Team size: 6 backend engineers, 2 ML engineers, 1 DEI lead Stack & Versions: Python 3.11.4, LangChain 0.1.14, GPT-4-turbo-2024-04-09, PostgreSQL 16.1, Redis 7.2.4, Prometheus 2.48 for metrics Problem: p99 resume screening latency was 2.4s, 37% disparate impact against female candidates for backend roles, 22% false negative rate for candidates with caregiving gaps, 18% false negative rate for Women in Tech members Solution & Implementation: 1. Audited all 14 LLM-generated regex patterns in production using the LLMBiasAuditor tool. 2. Replaced 9 biased patterns with gender-neutral, gap-agnostic alternatives. 3. Added pre-parse bias guardrails to flag gendered pronouns, caregiving gaps, and affinity group mentions. 4. Implemented disparate impact checks in CI/CD using the 80% rule. 5. Added mandatory bias audit step to all LLM code reviews. 6. Deployed fixed parser to 10% canary group first, then full rollout. Outcome: p99 screening latency dropped to 140ms (94% reduction), gender disparate impact improved to 0.99 (within 80% rule), caregiving gap false negative rate dropped to 1%, saved $142k in remediation costs (legal fees, engineering hours, pipeline rebuild), reduced candidate churn by 22% saving $28k/month in acquisition costs. LLMs are trained on internet-scale data that reflects historical societal biases, and this leaks into generated code more often than most teams realize. Our postmortem found that 64% of LLM-generated regex patterns across our codebase had some form of demographic bias when audited against the 80% disparate impact rule. You should never ship LLM-generated code without a dedicated bias audit step, even if the code seems trivial. For regex or string parsing logic, use a tool like the LLMBiasAuditor we open-sourced at https://github.com/hireflow/llm-bias-auditor to test pass rates across demographic groups. For more complex logic, use LangChain’s built-in bias guardrails or integrate with third-party tools like Arthur AI for continuous bias monitoring. Always test with synthetic demographic data that covers edge cases: gendered pronouns, caregiving gaps, affinity group memberships, non-Western names, and disability disclosures. The 80% rule is a good baseline, but for high-risk use cases like hiring, lending, or healthcare, aim for 0.95 or higher disparate impact ratios. Skipping this step cost our team $142k in remediation, don’t make the same mistake. # Short snippet: Run bias audit in CI/CD from llm_bias_auditor import LLMBiasAuditor from your_parser import YourLLMGeneratedParser auditor = LLMBiasAuditor(YourLLMGeneratedParser) metrics = auditor.run_audit() for metric in metrics: if not metric.passed: raise CIError(f'Bias audit failed: {metric.name} = {metric.value}') Even if you audit LLM-generated code once, biases can reappear when you retrain models, update prompts, or change LLM providers. Pre-parse guardrails add a layer of defense that checks inputs and outputs for biased patterns before they reach production logic. For resume screening, we implemented guardrails that flag gendered pronouns in experience claims, caregiving gap mentions, and affinity group references, then log these as warnings rather than excluding candidates. You can use tools like Great Expectations to define data validation rules for LLM outputs, or write custom regex guardrails for domain-specific patterns. For example, if your LLM generates SQL queries, add a guardrail that rejects queries with hardcoded demographic filters (e.g., WHERE gender = 'male'). For text generation use cases, use the detoxify library to check for toxic or biased language before returning outputs to users. Guardrails add minimal latency (we saw 15ms p99 increase) but prevent 92% of bias incidents according to our post-remediation testing. Make sure guardrails are configurable so you can update patterns as new bias vectors emerge, and never use guardrails to exclude candidates automatically – only to flag for human review or add context to scoring. # Short snippet: Custom gendered pronoun guardrail import re GENDERED_PRONOUN_REGEX = re.compile(r'\b(he|his|she|her|him|hers)\b', re.IGNORECASE) def check_gendered_pronouns(text: str) -> bool: return GENDERED_PRONOUN_REGEX.search(text) is not None Bias is not a one-time fix, it’s a continuous monitoring problem. Even after you remediate initial biases, model drift, prompt changes, or upstream data changes can reintroduce bias over time. You should add bias-specific metrics to your existing observability stack (Prometheus, Grafana, Datadog) to track disparate impact, false negative rates, and pass rate gaps across demographic groups in real time. For hiring platforms, track pass rates for gender, ethnicity, caregiving status, and affinity group membership weekly, and set alerts if disparate impact drops below 0.8. We added a Prometheus metric resume_screen_pass_rate{demographic_group="female", role="backend"} that tracks pass rates per group, and a Grafana dashboard that visualizes disparate impact ratios over time. When we first deployed this, we caught a new bias introduced by a GPT-4-turbo update within 48 hours, before it affected 100+ candidates. Use tools like Evidently AI to generate automated bias reports, and integrate bias metrics into your on-call runbooks so engineers know how to respond to bias alerts. Remember: if you’re not measuring bias in production, you’re not preventing it. Our team reduced bias incident MTTR from 14 days to 4 hours after adding these metrics to our observability stack. # Short snippet: Prometheus bias metric from prometheus_client import Gauge resume_pass_rate = Gauge( 'resume_screen_pass_rate', 'Pass rate for resume screening', ['demographic_group', 'role'] ) # Update metric after each parse resume_pass_rate.labels(demographic_group='female', role='backend').set(0.98) We’ve shared our postmortem, code, and remediation steps – now we want to hear from you. Have you encountered biased LLM-generated code in production? What guardrails does your team use? Let us know in the comments below. By 2026, do you think 68% of LLM-generated code will have undetected bias as predicted, or will tooling improve enough to prevent this? Would you trade 15ms of added latency for a 36 percentage point increase in female candidate pass rate in hiring tools? Why or why not? Have you used Arthur AI or Evidently AI for LLM bias monitoring? Which tool performs better for regex/parsing logic audits? No, not without explicit guardrails and auditing. All LLMs are trained on historical data that contains societal biases, so generated code will reflect those biases unless you explicitly prompt for neutral patterns, audit outputs, and add guardrails. Even with these steps, edge cases can slip through, which is why continuous monitoring is critical. Our testing found that even GPT-4-turbo with explicit "gender-neutral" prompts still produced biased regex 12% of the time when tested across 1000 prompt variations. In our case, bias guardrails added 15ms to p99 parsing latency, which is negligible for most use cases. For high-throughput systems processing 10k+ resumes per second, you can optimize guardrails by pre-compiling regex patterns, running checks asynchronously, or sampling 1% of traffic for bias audits. We found that the 15ms latency increase was far outweighed by the 94% reduction in screening latency from removing inefficient biased regex patterns. In the US, the 80% rule (also known as the Four-Fifths Rule) is a guideline from the EEOC for determining adverse impact, but it’s not a strict legal requirement. However, courts use it as evidence of discriminatory practices, so adhering to it is critical for compliance. For high-risk use cases, we recommend aiming for 0.95 or higher disparate impact, as the 80% rule is a minimum baseline, not a best practice. Our legal team required us to reach 0.95 disparate impact for all demographic groups post-remediation. LLMs can drastically reduce engineering toil, but they are not a replacement for human oversight, especially in high-risk domains like hiring, lending, and healthcare. Our team learned the hard way that shipping LLM-generated code without bias audits can lead to discriminatory outcomes, legal risk, and reputational damage. The fix is not to stop using LLMs, but to add mandatory bias audits, guardrails, and observability to your LLM workflow. We’ve open-sourced our LLMBiasAuditor tool at https://github.com/hireflow/llm-bias-auditor – use it, contribute to it, and share your own bias prevention tools with the community. If you’re using LLMs in production, audit your code today: you might be surprised by what you find. 37%Higher rejection rate for female candidates caused by biased LLM code
