Various PeopleFluent environments down

Minor incident PeopleFluent Hosted LMS
2022-01-18 09:28 CEST · 1 day, 46 minutes, 28 seconds

Updates

Resolved

After a thorough investigation, our hosting partner has been able to establish the cause of the incident, which was an incorrect configuration in the SQL Cluster. During the server migration of Autumn 2021, the database servers were migrated first. In this process, the configuration was copied from the old environments to the extent possible, incuding the availability monitoring. In the months following, the application environments were migrated one by one. That process was concluded in November. At the end of December, the entire old infrastructure was taken down. By rebooting certain environments in the course of regular maintenance, it turned out that the configuration of the database servers (as copied from the old ones) did not meet the requirements of our existing and renewed network. Where there should have been a check on certain settings in Azure, the checks were done on the availability of a server within the old (and no longer existing) infrastructure. The monitoring that should have flagged this had also not been adapted correctly. In the mean time, the following measures have been taken to prevent this from happening in the future:

  • The bug in the monitoring script has been corrected, such that this will issue a warning when a resource is offline, rather than an OK;
  • The old, copied configuration setting has been removed and adapted to the new Azure-based check.
January 19, 2022 · 10:14 CEST
De-escalate

All affected environments are back online. Later today we hope to be able to inform you about the causes of the incident.

January 18, 2022 · 09:44 CEST
Issue

Various PeopleFluent sites appeared to be offline this morning. Our hosting provider is currently restarting the affected environments. Causes are yet to be identified.

January 18, 2022 · 09:30 CEST

← Back