Optus Outage Analysed By SEN’s Network & Communications Engineer, Chris Olsen.
Optus Outage Analysed – It’s been reported by Matt Tett, the managing director of technology testing company, Enex TestLab, that there has only ever been a telecommunications outage of the magnitude of the recent Optus failure 3 or 4 times in the past 30 years.
And the question on the lips of all and sundry is: “What was the real cause of the problem?”.
When the CEO of Optus, Kelly Bayer Rosmarin, was recently interviewed on local radio, she was asked this very question. Her response: “The problem is too technical to explain”.
Bayer Rosmarin now faces the realistic possibility of a $A4 billion compensation bill, a review by the Australian Communications and Media Authority, and an official senate enquiry; such are the woes of most elevated.
Optus Outage Analysed By SEN’s Communications & Network Engineer
SEN was curious to try and pinpoint in greater detail what really caused upward of 10 million homes and over 400,000 Australian businesses to lose both internet and telecommunications connectivity for up to 16 hours last Wednesday. We investigated deeper, pushing past the general non-answer provided by Optus.
It turns out that the cause of the network shutdown was that its core routers received incorrect settings from one of the company’s overseas partners – the latest reports suggest this was Singtel – as part of a software firmware upgrade, causing a cascading failure; also known as flooding.
Those of you who remember IRC hacking and phreaking back in the 80s and 90s will have some idea how this works. It’s thought this incident was not born of a malevolent cyber security actor, though perhaps it’s too early to rule that out entirely.
It’s believed that the exact firmware fault can be tied to a BGP (border gateway protocol) prefix flood. Most of us in the computer or networked security industry have heard about BGP. In essence, it’s a protocol that routes data in a least-cost fashion to the closed next hop.
In this case, the firmware update or change broke BGP and caused it to route data through every path, instead of via the shortest path. As the firmware cascaded through the network, it opened the floodgates on each device, creating a virtual tsunami of data.
The routing table changes in the update propagated through multiple layers in the network and exceeded preset safety levels on the layer 2 routers. As the internal safety mechanisms on these routers were triggered, the only way for them to protect themselves was to disconnect from the Optus IP Core network.
To resolve the issue as quickly as possible, technicians had to physically travel to each affected device of which there are hundreds – some in third-part data centres – and manually revert the firmware to the previous version using a console cable and laptop. Thus, the 16-hour delay for network resurrection.
What can be done to mitigate such risks in the future? Avoiding single-point-of-failure network architecture should be considered when designing core networks to reduce the risk of total network outage. But sometimes, building a backup network is like asking the government to build a duplicate highway in case of an accident – it’s simply impractical.
That means security people need to factor communications redundancy and failover into their system designs.
Optus Outage Analysed By SEN’s Communications & Network Engineer
When relying on third parties to supply updates to core systems, it may be a better idea for national telcos to employ a small network engineering team to test third-party firmware before pushing it out across the whole network. The minor expense probably justifies the costs that will be incurred by an incident like this massive outage.
As an aside, it was reported on Thursday that Vodaphone had been buzzing with customers all day, as a 4-fold increase in activity was detected on its networks. Meanwhile, the Telstra Boost network saw a 5-fold increase in daily sales, while Kogan reported its sales of e-sims had increased by 400 per cent.
As former Prime Minister Malcolm Turnbull stated in relation to the incident, “Please note this as an example of how not to handle a crisis”.
You can see Optus’ response to the outage here or read more SEN news here.
“Optus Outage Analysed By SEN’s network engineer, Chris Olsen.”