Cloudkeeper is an open-source tool for Site Reliability Engineers (SREs) to perform cleanup of cloud infrastructure “drift,” leaky resources, and services that are triggering quota limits. For example:
- Unattached storage volumes without recent I/O
- Active Amazon Cloudwatch alarms for compute instances that no longer exist
- Load balancers without target groups
Because of cloud-native infrastructure and the ever-expanding menu of services from cloud providers, the potential use cases are limitless.
Cloudkeeper crawls your cloud, indexes and maps resources into a directed graph, and captures resource dependencies. Cloudkeeper ships with both a command-line interface (CLI) and a query language. The CLI makes it easy to search the graph for any asset and build workflows to collect, clean up, and generate metrics. There is also a Prometheus exporter for those metrics.
There is a plethora of existing cloud management tools that promise to perform those jobs. These tools usually fall into one of two categories: (1) individual asset discovery tools and (2) rule-based cleanup tools.
(1) Discovery tools generate long lists of resources—essentially a prettier view of the data from your cloud console. However, they do not perform cleanup. As a result, the vendors of these tools push professional services promising “optimization opportunities” or “actionable recommendations.”
(2) Cleanup tools, on the other hand, enforce rules and policies in your cloud accounts. But they do not facilitate asset discovery, nor do they aid in pinpointing the root cause of resource leaks.
In short, existing tools either provide reporting or automated clean-up, but not both. It is difficult to take insights from the reporting tools and use them for cleanup. As a result, the amount of infrastructure “drift” continues to grow
In our experience, SREs get stuck in a constant cycle of trying to determine what is running, if it should run, who is running it, and if it can be safely pruned. Many SREs maintain collections of scripts scheduled to execute at regular intervals, which quickly become unwieldy as new scripts are constantly added to handle new edge cases. We have spoken with dozens of SREs over the past six months, all of whom experience these problems in their day-to-day work.
As soon as Lukas generated this image, he realized there was simply no way for an individual engineer to manually manage this infrastructure. Lukas and D2iQ open-sourced Cloudkeeper because it helped them dramatically reduce sprawl, maintain control over all assets running, and free up the SRE teams’ time; they knew it could help others as well.
We have spent the last few months adding support for more services on AWS and GCP. We are keeping Cloudkeeper open source, because it helps to address the long tail of cloud services. Closed-source vendors will always have ROI considerations when building out support for a new cloud service. In fact, that is the reason why most SREs we’ve talked to use at least one commercial tool AND maintain a collection of scripts: to address cloud services and/or use cases not supported by the commercial tool(s) at their disposal. Our goal is to develop Cloudkeeper into an extensible product, and for it to be easy to add support for new resources or cloud providers.