On July 10, 2019, between 10:40 and 11:28 EDT, we experienced a loss of connectivity to our US-based data center (Cyxtera, BO3), causing a service outage for customer sites hosted there.
Sequence of Events
All times are in EDT.
10:40 - our monitoring systems reported packet/connectivity loss between the US data center and the rest of the Internet.
10:41 - 10:45 - multiple tests were conducted to determine the reason for the issue. With lost access to the equipment in the data center, ICDSoft's system administrators posted a ticket to CenturyLink (our main network provider) to check for any outage on their side.
10:45 - 10:50 - our team contacted the technicians at the Cyxtera BO3 data center to check for any problems on their side.
10:50 - 10:55 - it was determined that there is no issue at CenturyLink/Cyxtera, therefore leading to a possible problem with core network equipment on ICDSoft's side. Cyxtera technicians were dispatched to ICDSoft's cage.
11:00 - Cyxtera's technicians reported that all equipment was powered and seemingly working. ICDSoft's system administrators advised Cyxtera's technicians to reboot all core switches and routers.
11:10 - after a reboot of all core equipment, network connectivity was not restored. At this time, replacement of core switches and routers with spare ones was considered.
11:15 - after revising audit/event logs, ICDSoft's system administrators noticed a recent planned and completed update of complementary management servers that were directly connected to the core switches through a USB cable/management console.
11:20 - a request to disconnect these management servers was sent to Cyxtera's technicians.
11:25 - after disconnecting the management servers and rebooting the switches again, network connectivity was restored at 11:28.
The network of ICDSoft consists of redundant network devices, including latest data center-grade Cisco switches. Each of these switches has a complementary management server connected to it through a USB cable and a management console. The sole purpose of the complementary server is to provide direct access to the switch in cases when network connectivity is not available for some other reason. The operation of these complementary servers is not expected to affect the operation of the whole network in any way. The setup has been tested before being used in production (including multiple complementary server updates and reboots), and it had been in operation for more than six months.
In the morning of July 10, 2019, a minor OS update on non-production servers caused complementary management servers to be rebooted. The reboot coincided with the crash of the switches to which these servers were connected. The reason seemed to be a bug in the USB management console itself, which (probably) sent garbled information to the switches and crashed them. The issue and debug information will be reported to the vendor of the switches.
USB management ports are already replaced with RS-232 (classic) serial management cables. Also, maintenance procedures have been updated so that non-urgent maintenance on complementary servers is performed after scheduling it in a maintenance time frame.