ICDSoft has been operating hundreds of hosting servers for nearly twenty years now, and over the years we have significantly standardized and optimized our procedures. This enables us to quickly and effortlessly manage this fleet of servers.
On October 7, the system administrator on shift was working on a problem with the backup system of one of the servers. This problem affected the Backup and Restore features of the hosting Control Panel. All hosting services were fully functional, but the affected users could neither back up their data via the Control Panel, nor restore any data from the automatic daily backups.
While this would be a non-urgent problem for many hosting providers who would leave it for the morning shift to fix, here at ICDSoft even backup servers’ problems are escalated with the highest level of urgency as we cannot leave any aspect of our service non-operational, even for a short period of time.
The system administrator working on the server identified the problem – one of the backup servers in the US data center was having problems writing data, so the system administrator had to unmount it.
Since we have automated procedures and tests that would have alerted us about the different configuration, the system administrator decided to apply the change to the servers in all our data centers.
Unfortunately, the command he ran and tested in the US data center wasn’t applicable to the Hong Kong and European data centers, as the internal IP addresses of the backup servers are different there.
So, in effect, when running the command, he unmounted an important filesystem instead of the $backup filesystem.
The change was a “hotfix-style” update. It didn’t enter the deploy procedures, and it applied the fix to the current configuration of the server only. The standardization and internalization of this fix was left for the first shift of system administrators.
This meant that a single server reboot could fix the problem and such was performed, which, after the few minutes needed for reboot, brought almost all of the US and European servers back up.
The problems weren’t all solved, unfortunately.
While in the past year we completed a full upgrade of our hosting machines in the USA and Europe, the Hong Kong data center still has some of our previous-generation configurations. Due to this, the servers there needed a “hard reboot” via the KVM management console. This process took some additional time, but in less than 15 minutes, the servers were up and running, and client sites were operating properly again.
Lastly, a couple of the new servers in our US data center had problems booting up. These were problems we were aware of and were planning to take care of in our next scheduled maintenance window, but this reboot expedited this process. Additional 20-30 minutes were needed for the important updates on these servers, after which they were running correctly.
What are we doing to prevent similar issues?
While this situation isn’t unexpected – a command you run on one machine to have different effects on different live deployments, we have so far avoided it by planning our maintenance beforehand carefully. This is not enough however, and we are building new procedures which will require deployment to test environments before any live changes are applied to the entire server fleet. Individual server changes will remain outside of this deployment model, as in case of an emergency, our top priority is always to bring the server back up in the quickest way possible.
This was an upsetting incident for our team, as we haven’t experienced such incidents for over 15 years. We still have a lot to analyze and the responsible teams are still preparing their final reports which we need to evaluate.