AI News Hub Logo

AI News Hub

Building a Document Management Pipeline for Multilingual Safety Data Sheets

DEV Community
Diogo Heleno

Building a Document Management Pipeline for Multilingual Safety Data Sheets If you're developing systems for chemical manufacturers or exporters, you've probably encountered Safety Data Sheets (SDS). These aren't just documents—they're regulatory compliance requirements that can make or break international trade deals. A recent article on SDS translation requirements highlights the complexity of managing these across multiple markets, but what does this mean from a technical implementation perspective? Let's build a document management pipeline that handles the unique challenges of multilingual SDS workflows. SDS documents follow a rigid 16-section structure mandated by regulations like REACH (EU), OSHA HazCom (US), and GB CLP (UK). Unlike marketing content, these documents contain: Standardized hazard codes (H-statements) and precautionary statements (P-statements) Chemical identifiers (CAS numbers, EINECS numbers) Regulatory data that varies by jurisdiction Structured data that feeds into logistics systems The pipeline needs to handle versioning, ensure data consistency across languages, and maintain an audit trail for regulatory compliance. Here's a system design that addresses the core requirements: # Core data models from dataclasses import dataclass from enum import Enum from typing import Dict, List, Optional import uuid class SDSSection(Enum): PRODUCT_ID = 1 HAZARD_ID = 2 COMPOSITION = 3 FIRST_AID = 4 FIREFIGHTING = 5 ACCIDENTAL_RELEASE = 6 HANDLING_STORAGE = 7 EXPOSURE_CONTROLS = 8 PHYSICAL_CHEMICAL = 9 STABILITY_REACTIVITY = 10 TOXICOLOGICAL = 11 ECOLOGICAL = 12 DISPOSAL = 13 TRANSPORT = 14 REGULATORY = 15 OTHER = 16 @dataclass class SDSDocument: id: str product_id: str language: str version: str sections: Dict[SDSSection, str] regulatory_codes: List[str] last_updated: str compliance_status: str The trickiest part is managing regulatory codes that change meaning across jurisdictions. Here's how to structure this: class RegulatoryCodeManager: def __init__(self): self.code_mappings = { 'EU': { 'H315': 'Causes skin irritation', 'P302+P352': 'IF ON SKIN: Wash with plenty of water' }, 'US': { 'H315': 'Causes skin irritation', 'P302+P352': 'IF ON SKIN: Wash with plenty of soap and water' } } def get_localized_code(self, code: str, jurisdiction: str, language: str) -> str: # This would typically hit a database or API base_text = self.code_mappings.get(jurisdiction, {}).get(code) if base_text: return self.translate_regulatory_text(base_text, language) return code def translate_regulatory_text(self, text: str, target_lang: str) -> str: # Integration point with translation memory systems # Regulatory codes have standardized translations pass SDS documents need rigorous version control. Regulators may ask to see the exact version that was current on a specific date: import json from datetime import datetime class SDSVersionManager: def __init__(self, storage_backend): self.storage = storage_backend def create_version(self, document: SDSDocument, changes: Dict) -> str: version_id = f"v{datetime.now().strftime('%Y%m%d_%H%M%S')}" # Store the complete document state version_data = { 'document_id': document.id, 'version_id': version_id, 'timestamp': datetime.now().isoformat(), 'changes': changes, 'complete_document': document.__dict__ } self.storage.save(f"sds_versions/{document.id}/{version_id}.json", json.dumps(version_data)) return version_id def get_version_at_date(self, document_id: str, target_date: datetime) -> Optional[SDSDocument]: # Retrieve the version that was current at a specific date versions = self.storage.list(f"sds_versions/{document_id}/") valid_versions = [ v for v in versions if datetime.fromisoformat(v['timestamp']) str: job_id = str(uuid.uuid4()) # Extract translatable content while preserving structure translatable_segments = self.extract_segments(source_doc) # Submit to translation management system self.api.create_job( job_id=job_id, source_language=source_doc.language, target_languages=target_languages, segments=translatable_segments, domain='chemical_safety', workflow_stages=self.workflow_stages ) return job_id def extract_segments(self, doc: SDSDocument) -> List[Dict]: segments = [] for section, content in doc.sections.items(): # Split content while preserving regulatory codes segments.append({ 'id': f"{doc.id}_{section.value}", 'source_text': content, 'context': f"SDS Section {section.value}", 'locked_terms': self.extract_regulatory_codes(content) }) return segments def extract_regulatory_codes(self, content: str) -> List[str]: # Identify H-codes, P-codes, CAS numbers, etc. that shouldn't be translated import re codes = [] # H-statements (e.g., H315, H302+H332) h_codes = re.findall(r'H\d{3}(?:\+H\d{3})*', content) codes.extend(h_codes) # P-statements p_codes = re.findall(r'P\d{3}(?:\+P\d{3})*', content) codes.extend(p_codes) # CAS numbers cas_numbers = re.findall(r'\d{2,7}-\d{2}-\d', content) codes.extend(cas_numbers) return codes Automated checks can catch common errors before human review: class SDSQualityChecker: def __init__(self): self.required_sections = list(SDSSection) self.regulatory_patterns = { 'h_codes': r'H\d{3}', 'p_codes': r'P\d{3}', 'cas_numbers': r'\d{2,7}-\d{2}-\d' } def validate_document(self, doc: SDSDocument) -> List[str]: errors = [] # Check all 16 sections are present missing_sections = [ section for section in self.required_sections if section not in doc.sections or not doc.sections[section].strip() ] if missing_sections: errors.append(f"Missing sections: {missing_sections}") # Validate regulatory codes are properly formatted for section, content in doc.sections.items(): errors.extend(self.check_regulatory_codes(content, section)) # Cross-reference with source document if available errors.extend(self.check_consistency(doc)) return errors def check_regulatory_codes(self, content: str, section: SDSSection) -> List[str]: errors = [] # Ensure H-codes follow proper format import re h_codes = re.findall(r'H\d+', content) for code in h_codes: if not re.match(r'H\d{3}$', code): errors.append(f"Invalid H-code format in section {section.value}: {code}") return errors def check_consistency(self, doc: SDSDocument) -> List[str]: # Check that hazard codes in section 2 match those referenced in other sections errors = [] section_2_codes = set(re.findall(r'H\d{3}', doc.sections.get(SDSSection.HAZARD_ID, ''))) for section, content in doc.sections.items(): if section != SDSSection.HAZARD_ID: referenced_codes = set(re.findall(r'H\d{3}', content)) invalid_refs = referenced_codes - section_2_codes if invalid_refs: errors.append(f"Section {section.value} references codes not in hazard identification: {invalid_refs}") return errors This system needs to integrate with: PLM/ERP systems for product data synchronization Translation management platforms (Phrase, Lokalise, etc.) Regulatory databases for up-to-date code definitions Document generation tools for final PDF output Audit logging systems for compliance tracking For production deployment: Use event-driven architecture to handle document updates across languages Implement proper backup and disaster recovery for regulatory compliance Set up monitoring for translation workflow bottlenecks Consider geographic distribution based on your target markets Managing multilingual SDS documents is complex, but with the right technical foundation, you can build a system that ensures compliance while streamlining operations for your chemical industry clients. The key is treating these documents as structured data with regulatory constraints, not just text to be translated.