No time is a good time for an outage. And given the busy time preparing for moving, we put everything on hold and worked a problem which cropped up early Wednesday morning. Before we go any further, we want to apologise for the service interruption. We do all we can to plan work out of hours. Sometimes other things fail and we work to identify and remedy as quickly as possible. We also review a ton of other things any time something like this happens. How did we become aware of the event? Can our monitoring trap new events or indications? What actions were taken to recover. Prevention steps to avoid a repeat happening are just a few of the actions which take place.
So, what happened?
Around 2:00am we became aware of poor and patchy network performance in the core network. We reached out to upstream partners and established there was an issue. Isolation of the cause was complicated as there was also some brief (sub 3 second) rearrangements going on in the core network. So time was spent backing out those changes in case they were having unintended affects.
As troubleshooting proceeded it looking initially as if there was a hardware issue. The upstream core router is under full vendor maintenance, so replacement parts could be sourced if needed. Further investigation revealed major National network changes had also taken place overnight, and these were affecting the core router. Once this was identified as the root cause work was undertaken to workaround the issue and restore services.
We spent time and in some cases visited sites to bring routers back online. Monitoring continued to ensure everything was back to normal and all clients were online.
Thursday update Checking of the core and connections indicates everything is running as expected. We are returning to normal level of monitoring. Please make contact if you are having any issues.
We have already mentioned some of the behind the scenes activity. You will hear more about this in the future. One thing to note for this event is that to overcome the issue (which was caused outside of our environment and our upstream providers) a whole new handling process had to be developed. This would normally be the kind of thing that would take a couple of weeks to design, plan, prototype, test and then implem hree hours of 'sudden surgery' got things back together. Post-op we are keeping a close eye ot see this is solid. While we always work with the goal of avoiding issues in the first place, the extended teamwork meant we could overcome the issue and create a fresh service.
Again, our apologies for the interruption. We are continuing for the move as well, and as you will see, this places us well for the future.