Frogfoot's New Years Internet Outage

Possibly the longest internet outage in the past ten years raises questions about the quality of service in an industry which has seen competition stagnate.

Newsroom

By 

Newsroom

Published 

January 3, 2025

Frogfoot's New Years Internet Outage

Frogfoot, a local fibre network operator owned by Vox Telecom, had a 57 hour outage starting on New Years Eve, leaving all their Western Cape customers without a usable internet connection. The problem ended up being diagnosed on a public forum before the company's own staff managed to figure it out themselves.

Systemic, and sector-wide

This issue is not solely with Frogfoot. The problems with the South African fibre sector have become deep and structural, as there is little competition or meaningful industry oversight. This is not all that unique, after all, South Africa is a fairly cartelised economy.

But with telecoms, the ordinary competitive market conditions rarely apply, even at the best of times. A recent mapping study showed that 80% of home fibre customers in the Western Cape have only one fibre provider to choose from.

These conditions have created a sector-wide issue, with private companies that were once the envy of the first world for their rapid early rollout of fibre internet now falling behind the performance of even the rusting hulk known as Telkom.

Frogfoot scored 5.4/10 in a recent industry survey. One has to wonder, if there was competition and customers had more choices, could Frogfoot have ignored their network stability problem for more than a year?

The outage

The Frogfoot outage was between 31 Dec 23:35 to 3 Jan 08:40 - just over 57 hours, possibly the longest outage of all the major fibre networks in the last ten years.

After 3 days of flickering and lagging connections, we wanted to understand what happened. It turns out details of the incident were available on a public forum called ZANOG, while engineers at Frogfoot were still running around in a panic. From the person who found the problem:

"Long and the short was: the link between Tableview and Blouberg was flapping. When it came back there was a problem with MLAG which caused a flood which the Tableview switch was supposed to drop but it did not and just rebroadcast the flood. Then the link flapped again and the process repeated.”

In short, an unstable backhaul link in Tableview took down the whole of the Western Cape including areas as far as George. After repeatedly contacting the engineers at Frogfoot, the forum poster managed to direct their attention from the core to the faulty node in Tableview.

What is backhaul in networking? - Neos Networks

As it turns out, Frogfoot has been suffering from ongoing short outages for at least a year. Usually customers notice about 10 minutes of downtime a few times a week. These outages have been linked to software bugs in their network hardware and CPU constraints in the line cards in their core routers.

Timeline

When the December 31st outage started, there was also an outage at Frogfoot’s Tableview node, but because of the history of short outages, Frogfoot started troubleshooting their core routers, suspecting a software bug.

Day 1: Frogfoot rebooted their core routers and swapped a line card. They then informed customer that the replacement line card was also faulty and that a different one would be installed. At this point some industry sources were already wondering how both core routers and a replacement line card could fail in one day.

Day 2: As we understand, Juniper (equipment vendor) was now doing live troubleshooting on the Frogfoot routers as they could not reproduce the problems in their lab. A custom software image was prepared with an upgrade time estimate of an hour per core router. It’s unclear if these custom images were loaded as Frogfoot then suggested that there was a lower level problem with the line cards. Frogfoot doubled their core routers to four devices.

Day 3: An ISP found that every time the Tableview backhaul link flapped their Mikrotik router had the following log entry: "bridge RX looped packet - ff:ff:ff:ff:ff:ff ETHERTYPE 0x0806 (ARP)"

The ISP then asked Frogfoot to look for CPU spikes and traffic floods on ports at their various Western Cape nodes, with a suspicion that the earlier Tableview outage was related. The problem was identified as an Ethernet loop at the Tableview node, but full details about the cause of this are not public.

Frogfoot's official statement reads:

“Following the reboot and the addition of the core routers, the network has shown consistent stability across the region. These measures will prevent the software bug from causing widespread impact while we continue to collaborate with our hardware vendor. The vendor is actively replicating the issue in their labs to modify the code and permanently resolve it.”

A chronic problem

We have received reports that Frogfoot customers were not treated equally through the outage: it seems Vox customers (parent company) and Frogfoot business fibre customers were much less affected. This goes against the business ethics of running an ‘Open Access’ network.

Forum discussions generated a bunch of technical questions. The running theme was one about chronic incompetence:

  • Why is there no SLA (service level agreement) for Frogfoot home services? Their SLA states: “best effort” – access service availability guarantees are not available on this service.
  • Why are the Frogfoot core routers at their end-of-life stage? (MX10k3 is no longer in production)
  • Why are they using a router and not a switch?
  • How did this happen in their ‘network freeze period’ when they are not supposed to be making changes?
  • How has this been going on for more than a year?
  • Do they have any plans to fix this, or are they going to keep kicking the can down the road?
  • Was the Tableview outage on the 31st related?
  • Why did they not notice the traffic flood at the Tableview node when the problem started?
  • Was it a software bug or a loop? - Probably a loop, which triggered the core routers to stop forwarding traffic.

The latest official story from Frogfoot - blaming ‘the software bug’ - doesn't seem good enough. If South Africa is to remain a viable economy, it needs a viable modern internet system.

If Frogfoot ends up sharing a detailed root cause analysis of the incident, we are open to including it as an update to the article.

more articles by this author