A senior engineer spotted a bug in my pipeline. I fixed it the same day. Here is what I learned.
A few weeks ago I published an article about an event-driven order pipeline I built in .NET with Kafka and Azure Service Bus. Someone left a comment that stopped me in my tracks. Andrew Tan wrote: "One thing I'd watch: you now have two sources of truth in flight, PostgreSQL and Kafka. If the API crashes after writing to Postgres but before publishing, you've got an order that never gets processed. Have you considered using an outbox pattern or transactional writes to close that gap?" He was right. I had not thought about it properly. Here is what the original code did when a new order came in: Save the order to PostgreSQL Publish an event to Kafka Two separate operations. No transaction between them. If the API crashed, ran out of memory, got killed by Kubernetes, or just had a bad moment between step 1 and step 2, the order would exist in the database with a Pending status and never move forward. Nobody would know. No error. No alert. The order would just sit there. At low volume this probably never causes a visible problem. At scale, or in production with real money on the line, it is a serious reliability issue. The fix is called the outbox pattern. The idea is simple. Instead of writing to the database and then publishing to Kafka as two separate operations, you write the order and an outbox record in the same database transaction. The outbox record is just a row in a table that says "this event needs to be published." A separate background service then reads unprocessed outbox records, publishes them to Kafka, and marks them as processed. If publishing fails, the record stays unprocessed and gets retried. If the background service crashes mid-publish, it picks up the same record on restart. The database transaction is the source of truth. Either both the order and the outbox record are committed together, or neither is. There is no window where one exists without the other. First I added an OutboxMessage model: public class OutboxMessage { public Guid Id { get; set; } = Guid.NewGuid(); public Guid OrderId { get; set; } public string EventType { get; set; } = string.Empty; public string Payload { get; set; } = string.Empty; public DateTime CreatedAt { get; set; } = DateTime.UtcNow; public DateTime? ProcessedAt { get; set; } public bool Processed { get; set; } = false; public int RetryCount { get; set; } = 0; public string? Error { get; set; } } Then I updated the controller to write both in the same transaction: public async Task CreateOrder([FromBody] CreateOrderRequest request) { var order = new Order { ... }; var outboxMessage = new OutboxMessage { OrderId = order.Id, EventType = nameof(OrderEventType.OrderCreated), Payload = JsonSerializer.Serialize(order) }; db.Orders.Add(order); db.OutboxMessages.Add(outboxMessage); await db.SaveChangesAsync(); // one transaction, both or neither return CreatedAtAction(nameof(GetOrder), new { id = order.Id }, order); } No Kafka publish in the controller anymore. The controller just writes to the database and returns. Then I built the OutboxProcessorService as a BackgroundService that polls every 5 seconds: protected override async Task ExecuteAsync(CancellationToken stoppingToken) { while (!stoppingToken.IsCancellationRequested) { await ProcessOutboxMessagesAsync(); await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken); } } private async Task ProcessOutboxMessagesAsync() { var messages = await db.OutboxMessages .Where(m => !m.Processed && m.RetryCount m.CreatedAt) .ToListAsync(); foreach (var message in messages) { try { await PublishToKafkaAsync(message); message.Processed = true; message.ProcessedAt = DateTime.UtcNow; } catch (Exception ex) { message.RetryCount++; message.Error = ex.Message; } await db.SaveChangesAsync(); } } When an order comes in: INSERT INTO "Orders" ... INSERT INTO "OutboxMessages" ... Order f21613da created and outbox message queued Five seconds later: Processing 1 unprocessed outbox messages Order f21613da published to Kafka topic orders at offset 5 Outbox message published successfully UPDATE "OutboxMessages" SET "Processed" = true ... The gap is closed. The order and the outbox record live or die together in the same transaction. Kafka gets the event eventually, guaranteed. After I posted the fix, Andrew came back with more good points. He mentioned that with a 5-second polling interval, you are trading latency for database load. Fine at low volume. At scale you want FOR UPDATE SKIP LOCKED so multiple poller instances do not step on each other. He also asked what happens if Kafka is down. Currently unprocessed records pile up and the poller keeps retrying every 5 seconds with no backoff. That is worth fixing. A dead letter path and an alert on outbox message age would make this production-ready. Both are on the backlog. The current implementation is correct for a single poller. Horizontal scaling and dead letters are the next iteration. I almost shipped this without the outbox pattern. The original code worked perfectly in testing. Kafka and PostgreSQL both got their data. No errors. No warnings. The failure mode only shows up when something crashes between two operations that look like one. That is exactly the kind of bug that stays invisible until it costs someone something real. Public code review from people who know what they are looking at is genuinely valuable. Andrew's comment was worth more than any linter or test suite would have caught here. Source code: github.com/aftabkh4n/order-pipeline If you are building event-driven systems and not using the outbox pattern, it is worth understanding. The implementation is not complicated. The reliability guarantee it gives you is significant.
