Yesterday afternoon, Wednesday 6th March at 14:34 GMT, we experienced
a partial network outage which affected approximately 30% of our
customers on our IPv4 network. Our IPv6 network was unaffected.
What went wrong was an automated configuration error on an internal
router which resulted in complex internal routing issues and some
network unavailability. Our monitoring systems alerted us to this
immediately, and we then started to work on resolving the problem.
In accordance with our procedures, we attempted a manual failover of
the affected routers, unfortunately due to the nature of the problem
this did not resolve the configuration error. The next step was for us
to roll back the configuration on the affected routers. After this was
completed at 14:55 GMT we estimate that normal service was restored
for approximately 80% of affected customers. This also enabled us to
trace the source of the configuration error, so we were able to
manually remove all the erroneous additional configuration and return
all service to normal.
Normal service for all customers was resumed at 15:35 GMT.
Having determined exactly what went wrong with the automated
configuration we are already testing a patch to our systems to ensure
that this never happens again. We are also updating our rollback
procedure to include some further tests.
We should also add that as part of our planned core network upgrades
over the coming months we will be able to remove a lot of the
complexity in our internal network. By removing this complexity we are
aiming to greatly reduce the chances of this type of configuration
error.
I would like to apologise profusely for the impact that this outage
may have had on your business and ability to use Memset
services. Although our rollback procedures worked well in resolving
this issue we know there is always room for improvement. As such we
will be reviewing this outage with two objectives. Firstly to speed up
the time it takes to recover from this type of complex routing issue.
Additionally we aim to investigate ways in which we can reduce the
chances of errors like this in our automatic internal router
configuration.
Again, please accept our sincere apologies.
Alex.
----
Alex Coke-Smyth
Operations Manager