Introduction to the ClickHouse Bottleneck
Cloudflares billing pipeline recently faced a significant performance degradation, with jobs struggling to meet their daily deadlines. The issue emerged weeks after implementing a major redesign of one of their largest ClickHouse tables. This redesign aimed to improve the partitioning key by adding a new column, ultimately allowing more granular data retention policies. Despite rigorous engineering reviews, this change inadvertently introduced a previously unseen bottleneck.
Initial investigations into common performance metrics, such as I/O usage, memory consumption, rows scanned, and parts read, revealed no abnormalities. However, further analysis uncovered an unexpected source of the slowdown: lock contention during query planning. This finding exposed a fundamental inefficiency within ClickHouses internal mechanisms.
Scaling Challenges in ClickHouse
Cloudflare employs ClickHouse to manage over a hundred petabytes of data distributed across dozens of clusters. To simplify data ingestion for hundreds of internal teams, the organization built a system called ReadyAnalytics. This system aggregates data into a single massive table, disambiguating datasets through namespaces and adhering to a standard schema.
The schema design, including a primary key structure with namespace, indexID, and timestamp, was optimized for query performance. However, as the table grew to over 2PiB by late 2024 with millions of rows ingested per second, limitations in the original retention policy became apparent. A one-size-fits-all retention strategy failed to meet the diverse needs of multiple teams, necessitating architectural changes.
Redesign and the Onset of Issues
The table redesign introduced a new column into the partitioning key, enabling per-namespace retention policies. While this approach addressed the retention issue, it also introduced a subtle yet severe bottleneck. Lock contention during query planning emerged as the root cause, a scenario previously unencountered in Cloudflares usage of ClickHouse.
This contention arose because the added complexity in partitioning exacerbated synchronization demands during query compilation. The issue was particularly pronounced under high load, where multiple queries competed for planning resources, amplifying delays and disrupting workflows.
Diagnosing Lock Contention
Identifying lock contention required a departure from standard diagnostic practices. Traditional metrics, such as execution times and resource utilization, offered no clues. Instead, the focus shifted to examining query planning under concurrent workloads. Profiling tools and detailed logs revealed excessive time spent acquiring locks during query preparation.
This insight highlighted a gap in the systems ability to handle complex partitioning schemes efficiently. Addressing the problem required a deep dive into ClickHouses internal mechanisms and involved patching areas of the code responsible for lock management.
Implementing the Fix
The engineering team implemented patches to optimize lock acquisition and release during query planning. By reducing the granularity of locks and introducing asynchronous mechanisms, they mitigated contention without compromising data integrity. These changes were rigorously tested across a range of workloads to ensure stability.
The resolution not only restored the billing pipelines performance but also provided valuable insights into scaling ClickHouse for large-scale applications. This experience underscores the importance of anticipating hidden bottlenecks when implementing architectural changes in complex systems.