Afectando Sistema
-
Problemas de enrutamiento Datacenter Francia
-
27/03/2012 22:05
- 27/03/2012 23:45
-
Ultima Actualización 28/03/2012 13:56
Se presento un problema en los enrutadores Cisco con un bug de software. Ya el parche fue aplicado y el problema fue resuelto por el staff del Datacenter. Este incidente fue catalogado como desastroso y es la primera vez que sucede, ya que el uptime de network siempre ha sido el 100% durante estos años. El reporte oficial fue
Wednesday, 28 March 2012, 20:39PM
Hello,
We had a routing problem tonight, due to a software bug which affected 2 principal routers in Roubaix. These Cisco ASR 9010 ensure collecting the bandwidth of datacenters in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris, Brussels, Amsterdam, London and Frankfurt. Briefly, the routing heart in Roubaix.
This bug is known and is well related to new cards that we set end of January (24x10G per slot). For a random reason the card detects the RAM ECC errors and doesn't rout packets anymore. But certainly despite this the card is not declared as "breakdown" and remains in the router as if it was good.
Other routers will continue to send packets but there's none in front. That would cause a big issue and the network will not perform correctly.
The worse: a breakdown not net.
Tonight, 3 24x10G cards on 2 ASR 9010 routers had this bug almost in the same time. This broke the network in 3 pieces: USA/London/Amsterdam/Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, aspiring the packets in Roubaix. Usually, the traffic would have been rerouted but then it was aspired and blocked in Roubaix.
Therefore, we didn't exploit the network to manage it and recover the logs of all the routers in order to reveal the problem's origin.
We have navigated to the old one, with rescue/external connexions to connect to each backbone router and check whether the router is the origin of the issue.
This operation took time, since there are 2 broken routers and it took us time to understand that this not only due to the router rbx-g2-a9 but also due to rbx-g1-a9.
Once we've restarted the 3 cards, all went back to normal in 5 minutes.
3 Weeks ago. We have already opened a ticket to Cisco regarding the RAM ECC issue. Cisco worked on this matter and has provided .. this morning, the patch software to apply on these routers in order to fix the problem. We are going to start the operation tonight. No breakdown expected.
We will focus also, on how to improve the management of our routers if all the backbone is down for a reason that will never happen.
We know how to deal with this case, but it's quite long. Very long.