AI News Hub Logo

AI News Hub

Audit Logging Looked Like a Weekend Project. It Took Us 3 Months.

DEV Community
GrimLabs

When our PM dropped the ticket into the sprint, it said "Add audit logging for user actions. Estimated: 3 story points." Three story points. That's like, a day and a half of work. Create an audit_logs table, write some middleware to capture events, add a UI to display them. Ship it Friday, move on to real features. That was in January. We shipped something usable in late March. And honestly it still wasnt complete. There's a comment on Hacker News that i think about a lot. Someone said something like "audit logging is one of those problems that looks trivially simple from the outside and then destroys your sprint velocity for a quarter." Thats exactly right. The core concept is simple. Something happens, you record it. User logs in, you write a row. User deletes a record, you write a row. How hard can it be? Here's where it falls apart. This is the first question and you'll spend two weeks debating it. Do you log every API call? Every database write? Every button click? Just "important" actions? Who decides what's important? For SOC 2, you need to log access to sensitive data, changes to permissions, authentication events, and changes to security configurations at minimum. But your auditor might want more. And different customers want different things. We started with a simple approach: // Version 1: the naive approach // This is what "3 story points" looks like interface AuditLog { id: string; userId: string; action: string; timestamp: Date; } async function logAction(userId: string, action: string) { await db.auditLogs.create({ id: generateId(), userId, action, timestamp: new Date(), }); } // Usage await logAction(user.id, 'deleted_invoice_123'); Looks fine, right? But now your auditor asks: "What was the state of the invoice before deletion? Who had access to it? What IP address did the request come from? Can you prove this log entry wasnt modified after the fact?" Suddenly you need: // Version 5 (yes, version 5): what you actually need interface AuditEvent { id: string; eventType: string; actor: { id: string; email: string; role: string; ipAddress: string; userAgent: string; sessionId: string; }; target: { type: string; id: string; name: string; }; changes: { field: string; oldValue: unknown; newValue: unknown; }[]; context: { tenantId: string; requestId: string; source: string; // 'web' | 'api' | 'system' | 'migration' }; metadata: Record; timestamp: Date; integrity: { hash: string; previousHash: string; }; } Thats not 3 story points. Recording WHAT happened is easy. Recording the state before and after is surprisingly hard. You basically need to capture a diff of the entity for every change. For simple updates, you can compare the old and new values: // Capturing before/after state function captureChanges>( before: T, after: T ): Array { const changes: Array = []; for (const key of Object.keys(after)) { if (JSON.stringify(before[key]) !== JSON.stringify(after[key])) { changes.push({ field: key, oldValue: before[key], newValue: after[key], }); } } return changes; } But what about nested objects? Arrays? Related records? If someone changes a user's role, do you also log the change to every permission that role grants? What about cascading deletes? We went through five iterations of our change capture logic before it handled all the edge cases. This was the one that really surprised us. Adding an audit log write to every API endpoint sounds harmless until you realize that some endpoints handle thousands of requests per minute. Writing to the audit log synchronously adds latency to every request. Writing asynchronously means you might lose events if the queue backs up or the process crashes. We ended up with a buffered async approach: // Batch audit events to avoid per-request write overhead class AuditBuffer { private buffer: AuditEvent[] = []; private flushInterval: NodeJS.Timeout; constructor(private batchSize = 100, private flushMs = 1000) { this.flushInterval = setInterval(() => this.flush(), flushMs); } async record(event: AuditEvent) { this.buffer.push(event); if (this.buffer.length >= this.batchSize) { await this.flush(); } } private async flush() { if (this.buffer.length === 0) return; const batch = this.buffer.splice(0); await db.auditEvents.createMany({ data: batch }); } } Even this has problems. What happens to the buffer if the server restarts? What about multi-instance deployments where each instance has its own buffer? We ended up needing a message queue (Redis Streams) to handle this reliably. Writing logs is only half the problem. Reading them is the other half. Your compliance team needs to search logs by user, by time range, by action type, by affected resource. Your customer admin panel needs to show a filtered activity feed. Put this in a regular SQL table and queries get slow fast. A moderately active SaaS app can generate millions of audit events per month. You need proper indexing, partitioning, and probably archival to cold storage. According to NIST SP 800-92, audit logs should be retained for a minimum period based on your compliance requirements. SOC 2 typically wants 12 months. Some regulated industries require 7 years. Thats a LOT of data. The whole point of an audit log is that it cant be changed. If someone can modify the logs, they're worthless for compliance. But most databases let anyone with write access UPDATE or DELETE rows. You need to think about: Database-level protections (no UPDATE/DELETE on audit tables) Application-level protections (no code path that modifies audit records) Cryptographic integrity (hash chains that detect tampering) Access controls (who can even read the audit logs?) If i could start over, here's what i would do differently. Start with a clear schema from day one. Don't iterate your audit event structure 5 times. Look at standards like CloudEvents and base your schema on something proven. Use a separate datastore. Don't put audit logs in your main application database. Use a dedicated store that's optimized for append-only writes and time-range queries. Build the query UI early. Your compliance team will need to search logs before your admin panel is ready. Build a basic search interface from the start, not as an afterthought. Consider using a library or service. I got tired of building audit logging from scratch for every project, so I built AuditKit. Its open source, self-hostable, and handles the hard parts (hash chains, retention policies, tenant isolation) so you dont have to. The build vs buy analysis should happen before you start building, not after you've spent 3 months on it. If your building audit logging from scratch, here's a more honest estimate: Basic event capture: 1-2 weeks Before/after change tracking: 1-2 weeks Performance optimization: 1-2 weeks Query and search UI: 2-3 weeks Immutability and integrity: 1-2 weeks Multi-tenant isolation: 1-2 weeks Testing and edge cases: 2-3 weeks Total: 9-16 weeks of engineering time. Not 3 story points. Plan accordingly.