We are back up!

csl

Administrator
Site Staff
Apologies for the outage just now, our hosts had a large network issue! We should be getting a full report in due course!
 
From our hosts...

Yesterday afternoon, Wednesday 6th March at 14:34 GMT, we experienced
a partial network outage which affected approximately 30% of our
customers on our IPv4 network. Our IPv6 network was unaffected.

What went wrong was an automated configuration error on an internal
router which resulted in complex internal routing issues and some
network unavailability. Our monitoring systems alerted us to this
immediately, and we then started to work on resolving the problem.

In accordance with our procedures, we attempted a manual failover of
the affected routers, unfortunately due to the nature of the problem
this did not resolve the configuration error. The next step was for us
to roll back the configuration on the affected routers. After this was
completed at 14:55 GMT we estimate that normal service was restored
for approximately 80% of affected customers. This also enabled us to
trace the source of the configuration error, so we were able to
manually remove all the erroneous additional configuration and return
all service to normal.

Normal service for all customers was resumed at 15:35 GMT.

Having determined exactly what went wrong with the automated
configuration we are already testing a patch to our systems to ensure
that this never happens again. We are also updating our rollback
procedure to include some further tests.

We should also add that as part of our planned core network upgrades
over the coming months we will be able to remove a lot of the
complexity in our internal network. By removing this complexity we are
aiming to greatly reduce the chances of this type of configuration
error.

I would like to apologise profusely for the impact that this outage
may have had on your business and ability to use Memset
services. Although our rollback procedures worked well in resolving
this issue we know there is always room for improvement. As such we
will be reviewing this outage with two objectives. Firstly to speed up
the time it takes to recover from this type of complex routing issue.
Additionally we aim to investigate ways in which we can reduce the
chances of errors like this in our automatic internal router
configuration.

Again, please accept our sincere apologies.

Alex.

----
Alex Coke-Smyth
Operations Manager
 
Translated: they had a problem, it affected some people, they eventually fixed it, they are trying to make sure it doesn't happen again, oh and they are sorry it happened in the first place....... Simples
 
A hypothetical explanation:-

Over brained network configuration "expert" wants a salary increase - leaves a bug in system - lets it fall over - bosses panic - "expert" rides in on his white charger & fixes it in fairly short time - JOB DONE. He has a lever next review time. Simples!!

Seen it done when I worked in the wonderful world of I.T. Not seriously suggesting it has happened here - but it could have!!

Ian
 
Sorry about the downtime this morning. Our hosts had another network outage at about 0740 just as I was getting ready to leave! :mad:
 
I sat down with my breakfast bowl of porridge and mug of tea, clicked the SD button and nothing :eek:
I had to look at pigeon watch and agbbs, totally ruined my start to the day.

Neil. :)
 
Translated: they had a problem, it affected some people, they eventually fixed it, they are trying to make sure it doesn't happen again, oh and they are sorry it happened in the first place....... Simples


roflmao.gif

roflmao.gif
 
Here's his latest excuse :lol:

Dear Alex,

This morning, Friday 22nd March at 07:39 GMT, we experienced a
catastrophic hardware failure of one of our core switches. The failure
caused a power surge which resulted in a section of our core network
going offline. We believe this affected the following servers:


* matthae1


To bring this section of the core network back online we had to
replace the failed core switch with standby equipment. It was also
necessary to manually bring other core switches back online after the
effects of the power surge. By approximately 09:00 GMT we had restored
service to 50% of affected systems, with 100% of core network issues
resolved by 09:36 GMT.


Unfortunately the device that failed was due to be replaced in the
coming weeks as part of our core network upgrades. The primary goals
of this upgrade are to provide additional performance in our core
network and to provide the resilience required to avoid outages
related to this type of failure.


I would like to offer our sincere apologies for the impact that this
outage may have had on your business and ability to use Memset
services. As a result of this outage we will be looking to complete
the core network upgrades as soon as possible.


Again, please accept our heartfelt apologies.


Alex


--
Alex Coke-Smyth
Operations Manager
 
Back
Top