Cloudflare says a bad update broke its logging systems, causing data loss
- Cloudflare confirms that the update caused customer log data to be lost
- The incident lasted a total of 3.5 hours, resulting in a 55% loss of logs
- Despite a five-minute fix, the bug caused domino problems
Cloudflare has confirmed that a bad software update recently caused its customers to lose log data. The incident, which lasted approximately 3.5 hours, resulted in more than half (55%) of the logbooks being lost.
Embarrassed that the error occurred, the Californian company apologized to customers in a blog postpromising that a similar problem will not occur again.
Cloudflare also noted that failures within large-scale systems are inevitable, but subsystems must be built to protect themselves in the event of broader problems.
Cloudflare admits to losing data logs
The problem originated with Cloudflare’s Logpush service, which bundles logs from its global network and sends them to customers for compliance, debugging and analysis. A routine update to support a new data set caused the service to be misconfigured, causing the issue.
The company says a configuration error effectively told one of its internal servers, Logfwdr, that none of its customers had configured the log files to be sent, leading to the loss. Although technicians identified and fixed the bug within five minutes, the problem caused a deeper bug.
A built-in fail-safe, which sends logs to all clients instead of just those with active Logpush jobs, ultimately overwhelmed the system. The buffer system, Buftee, had to manage 40 times its usual capacity, causing the system to become unresponsive.
“We accept that errors and misconfigurations are inevitable. All our systems at Cloudflare must respond predictably and neatly to this,” the company writes.
Looking ahead, Cloudflare has committed to conducting regular overload testing to simulate this flaw, providing confidence that its systems can handle future bugs of a similar nature.