Service Providers > Networks & Service Platforms Blog

Reliable Network is the Lifeblood of the Connected World

by Guang Yang | 7月 21, 2022

Severe impacts of recent network outages

On July 8th, Rogers Communications – the largest Canadian telecom operator – experienced a widespread network outage that disrupted internet and cellular service around the country, blocking payment systems and emergency services and creating chaos for businesses and individuals.

According to a media report, the network failure was caused by "a maintenance update in our core network". “That update caused some of the routers in our system to malfunction and that malfunction caused traffic overload. And as a result of that, the whole system just shuts down.” This network failure recalled us of the last massive outage of Rogers’ network. In April 2021, millions of Rogers’ wireless customers already experienced a massive network outage. At that time, the operator blamed the outage caused on a software upgrade, and media reports indicated that “although not specifically identified, the company’s public statement points to the seat of the problem being in its core network.”

The two outages illustrated that the core network plays a critical role in guaranteeing network reliability, which was echoed by the recent network failure of KDDI in Japan. In early July, nearly 40 million Japanese customers and 260,000 corporate customers were hit by a massive outage of KDDI’s network. KDDI said the network failure occurred while a router for its core network was replaced during regular maintenance. An error prevented the connection of voice calls, and when this situation was being resolved, KDDI’s network experienced a high volume of VoLTE traffic that further impacted service quality for customers. The network services are finally fully restored after 86 hours of disruption, making it the most severe network incident in the telecommunication industry.

network reliability blog 1

Not only can the voice system but also the IoT system generate a signaling storm. In an October 2021 incident, NTT DOCOMO Japan experienced a 29-hour outage that affected about 4.6 million people for voice and more than 8.3 million people for data communication. There were problems when DOCOMO migrated the location information of about 200,000 IoT terminals from the old equipment to the new equipment. Then, a rollback operation was initiated, and the terminals were reverted back to the old equipment. This fallback operation triggered a large number of IoT terminals to re-initiate location registration information to the old server. A surging "signaling storm" quickly caused network congestion and spread to core network equipment for voice and data communications, resulting in the massive network outage.

All these outages impacted not only consumers and businesses but also critical infrastructure and services. Rogers’ network failure prevented access to services like 911 and emergency alerting. Rogers has been ordered by the Canadian regulator, CRTC, to provide a comprehensive explanation regarding the outage. KDDI’s network outage also severely impacted the emergency services. According to a media report, emergency calls made through KDDI’s network during the carrier’s massive system failure dropped by nearly half from the previous week. Losing the 3-digit emergency number will bring the regulators' wrath and damage the operator's reputation. It will potentially impair operators’ capabilities to explore new business opportunities.   

Reliability is key for building customers’ trust.

The 5G development has been accelerating telecom operators’ business transformation. In the consumer market, experience-based pricing plans have been successful in supporting 5G upgrades and improving value in some markets. The experience-based pricing strategy raises higher requirements for network quality, including network reliability. Meanwhile, the enterprise digital transformation demands also create new growth opportunities for telecom operators. For example, Chinese operators have recorded strong growth in the enterprise market. The enterprise market has become the largest source of revenue growth for Chinese operators.

With digital transformation moving forward, telecom operators’ network services will be tightly embedded into the production process of enterprise customers. The quality of service will have a direct impact on the enterprise’s production efficiency, cost, safety, etc. This will require a highly trusted relationship between operators and enterprise customers. It would be impossible for an operator to be involved in the core process of an enterprise digital transformation if it cannot build sufficient trust with the enterprise customer.

Network reliability plays a critical role in building such trust. Therefore, when 5G was standardized, 3GPP set very high requirements on network reliability for the industrial application scenarios. 

network reliability blog 2

The Ultra-Reliable and Low Latency Communications (URLLC) capability has been an icon of 5G networks and is widely advertised when operators explore the enterprise 5G markets. However, the recent network outages indicate there is still a long way for telecom operators to improve and guarantee network reliability to reach a six-nines standard.

Continuous core network investment is necessary.

Network reliability relies on multiple aspects. CSPs should first pay attention to software, particularly the software at core networks, as software has become the heart of telecom systems, and any failure at the core network will lead to a severe outage of complete network services. With the network virtualization progress, more and more telecom network elements and services are deployed over COTS hardware that only supports three-nines reliability. Software innovation will play a critical role in achieving the five-nines or even six-nines reliability on the three-nines COTS hardware platforms.

Beyond the software, core network reliability relies on a holistic approach to coordinate all network elements, such as NFVI, CloudOS, VNF/CNF, data storage, network management, etc. A well-designed core network architecture should be able to prevent widespread signaling storms, provide storage backup capabilities to ensure critical data is not lost, and have multi-level disaster recovery capabilities, ranging from component-level disaster recovery to data center disaster recovery.

Artificial Intelligence (AI) based network automation features can be leveraged to monitor network performance, track the health of network components, and flag potential malfunctions. In the event of network problems or a security breach, these features can automatically shift operations from failing components to backups to make the service not interrupted. In addition, automation features can help operators reduce the chance of human error, which is a common reason behind network outages.

Even with advanced automation features, sufficient redundancy is always the foundation for network reliability. Independent and redundant network components can deliver network resilience and continuous availability by improving fault tolerance and enabling automatic failover.

Today, some telecom operators are migrating their core networks to the public cloud platforms. For example, AT&T and Microsoft have initiated a collaboration to leverage Microsoft's hybrid cloud to support AT&T’s 5G core network workloads. Even though public cloud hyperscalers claim they can support four-nines or even five-nines reliability too, the records of massive cloud platform outages, such as the AWS outage in November 2021, Microsoft cloud outage in April 2021, etc., illustrate it is still uncertain and challenging to implement carrier-grade availability and reliability on public cloud platforms. Telecom operators will have to work closely with hyperscalers to improve reliability by influencing hardware and software design, components tools, and even operation processes. Many of these will, however, lead to increased costs, contrary to the initial motivation of migrating to public clouds.

In summary, network reliability is the foundation for telecom operators to gain customers’ trust in the connected world. As its name indicates, a core network is at the core of network reliability. Telecom operators must continuously invest in advanced technologies and architecture to improve and guarantee network reliability, particularly to build a reliable and robust core network. 

Previous Post: When there is $45.2 Billion on the table, it's amazing how many people show up | Next Post: How to Build and Maintain a “Good Pipe”? – Experience Key for Long-Term Competitiveness

Let's talk

Now you know a little about us, get in touch and tell us what your business problem is.
Name:
Email:
Telephone:
Country:
Inquiry / Message: