12 practices that make on-call sustainable for small teams
How small teams can run on-call without burning out (12 actionable practices) Running reliable infrastructure with a small team? You're probably familiar with this nightmare: the same three engineers getting paged at 2 AM, spending hours on issues that could be automated, and slowly burning out from unsustainable on-call rotations. I've seen teams of 5-15 engineers maintain 99.9% uptime without killing themselves. Here's how they do it. Unlike companies with dedicated SRE teams, small engineering teams wear multiple hats. Your backend developer is also your infrastructure engineer, database admin, and on-call responder. Traditional on-call practices designed for large teams don't work here. Junior engineers shouldn't debug production database issues at 3 AM. Define exactly when to escalate: Customer-facing services down > 15 minutes Any data corruption detected Security incidents After 30 minutes of unsuccessful troubleshooting This protects both junior engineers from impossible situations and senior engineers from unnecessary wake-ups. Your runbooks should work for a sleep-deprived engineer who didn't write them. Include exact commands, expected outputs, and clear escalation points. # Database connection fix - Max time: 5 minutes # If this doesn't work, escalate immediately 1. Check connection pool: docker exec app-container pg_pool_status 2. Expected output: "pool_size: 20, active: <20" 3. If pool exhausted, restart: docker restart app-container 4. Verify in 60 seconds: curl -f https://app.com/health Too many alerts train engineers to ignore their phones. Route alerts by severity: Critical: Phone call + SMS Warning: Slack ping Info: Email only One alert storm shouldn't destroy your team's trust in the monitoring system. When your database crashes, you don't need 30 alerts about every dependent service. Configure your monitoring to suppress downstream alerts when upstream services fail. Most monitoring tools support this, they call it "alert dependencies" or "suppression rules." If your team manually fixes the same issue twice per month, automate it. Common candidates: #!/bin/bash # Auto-cleanup script for disk space alerts USAGE=$(df /var/log | tail -1 | awk '{print $5}' | sed 's/%//') if [ $USAGE -gt 85 ]; then find /var/log -name "*.log" -mtime +7 -delete systemctl reload nginx echo "Cleaned logs, disk usage now: $(df -h /var/log)" fi Schedule handoffs at specific times, not "whenever." The outgoing person should brief their replacement on: Current system health Ongoing issues Scheduled maintenance Anything weird they noticed Create separate Slack channels for incidents. Keep urgent technical discussion away from general team chat. Include stakeholders like customer success when incidents affect users. Track early warning signals: Response times increasing Queue depths growing Error rates climbing This gives on-call engineers time to act before complete failure. Set investigation limits before switching to restoration mode: Performance issues: 30 minutes max Service outages: 15 minutes max Unknown errors: 45 minutes max After the time limit, restore from backup, switch to standby, or escalate. Debug later. Don't rely on just Slack. Use: SMS for critical alerts Phone calls for extended outages Push notifications via PagerDuty/Opsgenie Email as backup Test these monthly. After incidents or monthly, review what happened: What tools would have helped? Which runbooks need updates? What monitoring gaps exist? Focus on systemic improvements, not individual blame. Set clear expectations: Acknowledge alerts within 15 minutes Begin investigation within 30 minutes Compensate with additional pay, time off, or flexibility Don't implement everything at once. Start with your biggest pain points: Too many false alerts? Begin with alert routing and grouping Chaotic incident response? Focus on communication and runbooks Engineers burning out? Start with escalation boundaries and compensation Implement 3-4 practices over 2-3 months. Measure the impact with metrics like mean time to resolution and engineer satisfaction. Sustainable on-call practices aren't about eliminating incidents, they're about handling them efficiently without destroying your team. Small teams can maintain reliable systems, but only with practices designed for their constraints. These approaches scale with your team and evolve as your systems grow more complex. Originally published on binadit.com
