Earlier this morning a backup process interrupted a database node as part of the CloudPBX system. In this database state, the database node continued to accept connections, but was unable to service them leading to stuck queries, updates and transactions. Because the node was acting like it was alive, the mechanisms that usually handle node failures did not work.
The affected node was taken offline and the underlying issues were addressed in relation to the backup process which caused the issue, and the node was recovered. In addition, we’ve identified an improved way of detecting node availability that covers all existing scenarios, including this one, and have implemented this.
The incident affected traffic on approximately 30% of requests on two CloudPBX home servers, approximately 4% of users.