Huawei Cloud has developed RD-Probe, a sophisticated network monitoring tool that, when deployed in three of its regions, revealed more infrastructure issues than existing tools and detected problems previously missed by human efforts.
This tool, detailed in a paper presented at the SIGCOMM 2024 conference in Sydney, addresses the challenges of network monitoring at hyperscale. Huawei Cloud’s data center networks include over 100,000 switches and a million servers. Monitoring such a vast infrastructure is challenging, especially in virtualized environments that use randomness for load balancing.
RD-Probe focuses on monitoring each physical Layer 2 port to observe the runtime status of switch fabrics, offering more coverage than virtual network monitoring. This approach helps eliminate blind spots that could miss critical issues.
The tool seamlessly integrates with existing monitoring architectures, modifying only the task generation and data processing modules. It employs a two-phase probing scheme, both random and deterministic, to ensure comprehensive monitoring coverage.
A dedicated 16-node cluster generates the probes, and a streaming 48-node cluster processes the data. Within a month of deployment, RD-Probe uncovered numerous issues, including:
- A faulty chip in the line processing unit of a core switch, causing dropped incoming packets and failing to report the issue to the control plane.
- Flawed load balancing that directed traffic through the local port instead of stack cables.
- Incorrect BGP route values, leading traffic onto a slow path.
Huawei’s researchers reported that RD-Probe improved monitoring coverage from 80.9% to 99.5%, revealing several previously unnoticed issues and tolerating numerous faults. The tool is set for broader implementation across more cloud regions.
Huawei Cloud built a network monitor so sensitive it spotted the impact of a single faulty chip https://t.co/CMnx0DnUIf
— The Register (@TheRegister) August 7, 2024
However, the paper’s authors noted that RD-Probe does not consider North-South traffic and can’t filter out server-side failures, which remain on Huawei’s to-do list.
RD-Probe represents a significant advancement in network monitoring, offering enhanced detection capabilities that can preempt service degradation.
Abdul Rehman
Abdul is a tech-savvy, coffee-fueled, and creatively driven marketer who loves keeping up with the latest software updates and tech gadgets. He's also a skilled technical writer who can explain complex concepts simply for a broad audience. Abdul enjoys sharing his knowledge of the Cloud industry through user manuals, documentation, and blog posts.