This website uses cookies

Our website, platform and/or any sub domains use cookies to understand how you use our services, and to improve both your experience and our marketing relevance.

Huawei Cloud’s RD-Probe: Precision Network Monitor Detects Single Faulty Chip Impact

Updated on August 7, 2024

2 Min Read
Huawei Cloud's RD-Probe Detects Single Chip Faults

Huawei Cloud has developed RD-Probe, a sophisticated network monitoring tool that, when deployed in three of its regions, revealed more infrastructure issues than existing tools and detected problems previously missed by human efforts.

This tool, detailed in a paper presented at the SIGCOMM 2024 conference in Sydney, addresses the challenges of network monitoring at hyperscale. Huawei Cloud’s data center networks include over 100,000 switches and a million servers. Monitoring such a vast infrastructure is challenging, especially in virtualized environments that use randomness for load balancing.

via GIPHY

RD-Probe focuses on monitoring each physical Layer 2 port to observe the runtime status of switch fabrics, offering more coverage than virtual network monitoring. This approach helps eliminate blind spots that could miss critical issues.

The tool seamlessly integrates with existing monitoring architectures, modifying only the task generation and data processing modules. It employs a two-phase probing scheme, both random and deterministic, to ensure comprehensive monitoring coverage.

A dedicated 16-node cluster generates the probes, and a streaming 48-node cluster processes the data. Within a month of deployment, RD-Probe uncovered numerous issues, including:

  • A faulty chip in the line processing unit of a core switch, causing dropped incoming packets and failing to report the issue to the control plane.
  • Flawed load balancing that directed traffic through the local port instead of stack cables.
  • Incorrect BGP route values, leading traffic onto a slow path.

Huawei’s researchers reported that RD-Probe improved monitoring coverage from 80.9% to 99.5%, revealing several previously unnoticed issues and tolerating numerous faults. The tool is set for broader implementation across more cloud regions.


However, the paper’s authors noted that RD-Probe does not consider North-South traffic and can’t filter out server-side failures, which remain on Huawei’s to-do list.

RD-Probe represents a significant advancement in network monitoring, offering enhanced detection capabilities that can preempt service degradation.

Share your opinion in the comment section. COMMENT NOW

Share This Article

Abdul Rehman

Abdul is a tech-savvy, coffee-fueled, and creatively driven marketer who loves keeping up with the latest software updates and tech gadgets. He's also a skilled technical writer who can explain complex concepts simply for a broad audience. Abdul enjoys sharing his knowledge of the Cloud industry through user manuals, documentation, and blog posts.

×

Webinar: How to Get 100% Scores on Core Web Vitals

Join Joe Williams & Aleksandar Savkovic on 29th of March, 2021.

Do you like what you read?

Get the Latest Updates

Share Your Feedback

Please insert Content

Thank you for your feedback!

Do you like what you read?

Get the Latest Updates

Share Your Feedback

Please insert Content

Thank you for your feedback!

Want to Experience the Cloudways Platform in Its Full Glory?

Take a FREE guided tour of Cloudways and see for yourself how easily you can manage your server & apps on the leading cloud-hosting platform.

Start my tour