connect.backdrop.cloud not loading
Incident Report for Impero Connect
Postmortem

A core component of the Connect system which ordinarily works in a paired node configuration (2 server nodes performing the same function) became unstable. When troubleshooting this, we found that one of the nodes had low memory and needed to be resized to allocate more memory. The resizing operation requires the target node to be restarted and takes it out of service for the time it takes to restart, usually around 2 minutes or so. At the point of restarting the target node, the whole Connect platform failed entirely which should not have happened.
We found at this point, that the remaining node that should have taken the load temporarily did not have the service running for this part of the system. At this time focus shifted to try and get the service running on the primary node as the secondary node (the targeted node) could not start the service until it was able to communicate with the primary node. The underlying cause of the service not starting on the primary node was due to a corruption on one of many message queues it is designed to process.

We have learnt from this incident that we need to add additional monitoring to our systems to identify these types of failure better. We have also noted that we need to add steps to our processes, for this type of operation, to check the health of all nodes running in pairs or sets, to ensure that they have the expected service running, and that they are able to cope with temporary increases in load so this type of standard operation does not result in a catastrophic failure in the future.

Whilst this is incident is regrettable, we believe that lessons learnt from this incident will help to strengthen our operations and contribute to the future stability of Impero Connect.

Posted Mar 04, 2024 - 09:18 UTC

Resolved
This incident has been resolved.
Posted Feb 29, 2024 - 17:15 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 29, 2024 - 14:23 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 29, 2024 - 12:55 UTC
Investigating
We are currently investigating this issue.
Posted Feb 29, 2024 - 10:55 UTC
This incident affected: Impero Connect Portal (Remote access services, Login, Impero Connect Portal frontend, OnDemand).