Skip to Content

Reducing Downtime in Atlantis with Efficient Kubernetes Configuration

5 April 2026 by
TechStora

Understanding the Problem of Prolonged Atlantis Restarts

Frequent restarts of the Atlantis tool, used for managing Terraform changes, resulted in significant engineering downtime. Each restart took approximately 30 minutes, during which no plans or infrastructure changes could be made. With an average of 100 restarts per month, this equated to over 50 hours of blocked productivity. The primary triggers for these restarts included credential rotations and onboarding new repositories, which were essential yet disruptive tasks.

Upon analysis, the issue was traced back to a bottleneck in the Kubernetes environment. Specifically, the PersistentVolume (PV) used by Atlantis had grown to millions of files, which slowed down the restart process. This was caused by a default Kubernetes configuration that became increasingly inefficient as the system scaled. Addressing this problem was critical to reducing operational inefficiencies.

The Role of Inodes in PersistentVolume Performance

The PersistentVolume storing Atlantis' repository state was found to be consuming inodes at a rapid rate. An inode is a data structure that tracks file system objects, such as files and directories. Every file or directory created on a disk uses an inode, and the number of inodes available is determined when the file system is formatted. In this case, the Ceph storage backend used by the Kubernetes platform applied default formatting parameters, limiting the number of inodes.

As the number of files grew, the limited inode availability forced a system-wide bottleneck. Once the inodes were exhausted, the storage volume required resizing. This process necessitated a restart of the Atlantis StatefulSet, which itself was time-consuming due to the sheer volume of files stored on the disk.

Investigating Inefficiencies in the Restart Process

Restarting Atlantis involved terminating the existing pod and spinning up a new one using the Kubernetes command kubectl rollout restart statefulset. During this period, the system was effectively offline, preventing any infrastructure changes. The issue became more apparent when a restart was triggered to address inode exhaustion, highlighting the inefficiency of the process.

The investigation revealed that the file-intensive operations during the restart were disproportionately impacted by the growing number of inodes. Extending alert windows or delaying response times were considered but dismissed as these would merely mask the underlying issue without resolving it.

Implementing a Simple Yet Effective Solution

The solution was remarkably straightforward: adjusting the Kubernetes configuration to optimize the filesystem's inode allocation. By modifying a single parameter during the PersistentVolume setup, the team ensured a higher inode count, accommodating the growing file storage requirements of Atlantis. This change eliminated the need for frequent volume resizing and subsequent restarts.

Applying this adjustment required a careful approach to minimize disruption. The team executed the configuration update during a maintenance window, ensuring that the new settings were applied without impacting ongoing operations. The result was a dramatic reduction in restart times, freeing up valuable engineering hours for more productive tasks.

Quantifying the Gains in Engineering Efficiency

The impact of the solution was evident in the improved restart times and reduced downtime. By addressing the root cause of the problem, the team eliminated the persistent bottleneck that had been consuming resources. This translated to over 50 hours of regained productivity each month, along with a more reliable operational environment for Terraform management.

This case highlights the importance of scrutinizing default configurations in any IT infrastructure. Small inefficiencies, when compounded over time, can lead to significant resource wastage. Proactively addressing these issues ensures not only better performance but also a stronger return on investment for the organization.