The Cloudflare Logging Catastrophe

How a blank configuration file triggered a 40X buffer explosion that lost 2.5 trillion logs

Aug 08, 2025

Last week was Honeycomb's database mess. This week it's Cloudflare, because apparently logging infrastructure just decides to quit sometimes.

Cloudflare lost 2.5 trillion logs on November 14th. Not gradually, not mysteriously. They lost them because of a blank file and a failsafe from 2019 that everyone forgot existed.

Background

Cloudflare processes 50 trillion events daily through this pipeline:

Logfwdr (forwards from edge)
Logreceiver (sorts by type)
Buftee (buffers)
Logpush (delivers)

Simple. Each service does one thing. 4.5 trillion customer logs daily, no problem.

In 2019 someone decided that if Logfwdr lost its config, it should forward everything rather than nothing. Made sense then. Cloudflare was smaller. The blast radius was manageable.

By 2024, that decision was still there in the code. Nobody remembered it. Nobody checked if it still made sense at current scale. The system worked perfectly so why would you?

What Happened

11:43 UTC, November 14th. A config generator hits a bug and produces an empty file. Not corrupted, not malformed. Just... empty. Like someone hit Ctrl+S on a blank document.

Logfwdr loads this empty config. Sees zero customers configured. The 2019 failsafe kicks in: "Zero can't be right, send EVERYTHING."

Instead of forwarding logs for 100,000 configured customers, it starts forwarding for every Cloudflare customer that exists. All of them.

The team catches this in five minutes. Reverts the change, fixes the config. Five minutes. Should be fine, right?

Those five minutes had already started something unstoppable.

The Buffer Explosion

Buftee starts creating buffers for every customer whose logs are arriving. That's its job. Create buffers on demand, manage them, keep data flowing.

Before: 1 million buffers globally. During: 40 million buffers. After: Complete system seizure.

Think about this. You're managing 1 million things. Suddenly you need to manage 40 million. Not gradually. Not with warning. Right now. The system didn't crash, crashing would've been better. It just froze, consuming resources, responding to nothing.

Here's what makes me angry: Buftee HAD protections for this. Buffer limits, resource caps, circuit breakers. All built in. All tested.

Never configured.

Five years running smooth, these safety features became theoretical. Sure they're in the docs. Probably mentioned in onboarding. But when you've never needed them, when nothing has ever gone wrong, they're just... there. Like that fire extinguisher you walk past every day.

The Hidden Bug

Investigation finds something else. Something that's been there for YEARS.

When Logfwdr stopped forwarding logs, it wouldn't start again without a restart. The feature flag was basically write-only. You could tell it to stop, but telling it to start did nothing unless you restarted the service.

How did this go unnoticed? Because Cloudflare deploys constantly. Every config change = deployment. Every deployment = restart. The bug existed but normal operations kept fixing it automatically.

During the incident they paused deployments (standard procedure, don't change things while everything's burning). Except pausing deployments removed the accidental fix. The thing meant to prevent damage made recovery impossible.

This is insane. Their system had evolved to depend on constant deployments. Not by design. Not documented. Just emergent behavior from years of continuous deployment accidentally working around a bug nobody knew existed.

Recovery

Buftee is completely dead. Customer logs piling up. They have to make the call.

They pull the plug. Full cluster restart. Accept that 2.5 trillion events are gone forever.

Even after the hard reset, things are still broken. The feature flag bug means services won't forward logs. They have to fake database timestamps, restart services in a specific sequence to avoid retriggering the cascade. The whole recovery is duct tape and prayer.

What This Actually Means

Cloudflare did everything right. Failsafes, monitoring, gradual rollouts, quick rollbacks. Every best practice. The failure came from interactions nobody could predict because they required multiple unrelated things to fail in exactly the right way.

The system didn't break because something went wrong. It broke because everything worked exactly as designed, just not how anyone expected when combined.

Your systems aren't just running code anymore. They're accumulating behaviors. Every deployment, every config change, every operational pattern becomes part of how the system actually works, whether you intended it or not.

That 2019 failsafe aged into a liability but the system had learned to work around it. Those constant deployments became a critical dependency nobody knew about. Those safety features sat unused until desperately needed.

We're not operating systems anymore. We're negotiating with them. Trying to understand what behaviors have emerged from years of accumulated decisions. Modern infrastructure evolves its own patterns, its own dependencies. Things nobody designed but somehow became critical.

Cloudflare now does "overload tests" where they deliberately trigger cascades. They're basically archaeologists excavating their own infrastructure, trying to map what it actually does versus what it was designed to do.

The Cloudflare incident shows our systems have grown beyond our ability to fully understand them. We can know every component perfectly and still be surprised by their interactions. We can monitor everything and still be blind to the patterns that matter.

Sources

From Cloudflare's postmortem, their status page, and some academic papers on cascade failures I mostly skimmed.

Check your buffer limits. Actually configure them this time.

root cause

Discussion about this post