AI News Hub Logo

AI News Hub

My CI Runner Was Killed by My Own Script: The Dark Side of Cleanup

DEV Community
Mustafa ERBAY

Towards the end of last month, I started a build job on my self-hosted GitHub Actions runner. It was a job that normally took 10-15 minutes, but this time it just wouldn't finish. The job seemed stuck, and I wasn't getting any response from the runner. When I tried to connect to the server via SSH, the connection was refused. It felt similar to the OOM scenarios I'd experienced on my VPS where sshd couldn't accept connections, but this time my RAM usage was normal. After some digging, I realized my runner's heart had stopped beating. Looking at the GitHub Actions panel, I saw the runner was "Offline". The interesting thing was that the server itself was up, and my other Docker containers were running without issues. The only problem was my CI runner. To understand why the runner had died, I connected to the server via console. My first task was to check the dmesg output. There was nothing surprising there; no kernel-level error or OOM killer trigger was visible. When I checked the service status with the systemctl status github-runner command, I encountered an even more interesting situation: the service was active (exited), and there were no error messages in the logs. It was as if someone had gracefully shut down the service. It was at this exact moment that the "innocent" cleanup script I'd added last week came to mind. I manage over 13 Docker containers on my own VPS, and disk space can sometimes become critical. Especially Docker's build cache and unused images, with 33 GB of build cache and 23 GB of unused images, can fill my disk up to 100%. Because of this, I had written a script to clean up old build outputs and unnecessary files in the _work directory. ⚠️ Chaos of My Own Making This kind of automation can be a lifesaver, yes. But if it's not tested sufficiently or if scenarios aren't well thought out, shooting yourself in the foot becomes inevitable. My scenario was exactly that. The script I wrote simply deleted files older than a certain age in the _work directory. However, I had overlooked a small detail: the runner itself also operated within the _work directory, and temporary directories like _temp, or even in some cases the runner's own binaries or configuration files, could fall within this scope. I had previously experienced the pain of deleting directories inside _work/_temp on a GitHub Actions runner, but this time I had gone even further. I hadn't used parameters like maxdepth or prune in the find command within the script carefully enough. While my goal was only build artifacts, the script had deleted some files vital for the runner itself. The result: The runner service quietly shut down when it couldn't access the necessary files to continue operating. This was a resource management disaster, similar to my Astro build consuming 2.5 GB of RAM and hitting OOM, but this time it was disk and file system related. # A snippet from the faulty cleanup script (simplified version) # This command was deleting all files older than 7 days under the _work directory. # However, the runner's own working files were also included in this scope. find /home/runner/_work/ -type f -mtime +7 -delete find /home/runner/_work/ -type d -empty -delete This command, `/home