Infrastructure provisioning can still sometimes feel like an endless loop of reinventing the wheel, unless you have the right process and capabilities in place. For many organizations, the solution is to adopt Infrastructure-as-Code (IaC) to automate provisioning across multiple public clouds and private data centers.
IaC offers a more modern and efficient way to deliver infrastructure and infrastructure changes without bottlenecks, speeding up application delivery time. But it’s essential to install guardrails within your IaC workflow, since there is no longer a conventional ticket-based IT process to ensure all infrastructure changes follow security best practices.
As HashiCorp’s CISO, I’m tasked with my organization’s security, so enforcing policies with built-in Policy-as-Code guardrails is essential for implementing Infrastructure-as-Code. IaC without standardization and guardrails increases the risk of security incidents due to the lack of governance and difficulty maintaining compliance in an environment where hundreds of developers and operators are provisioning and interacting with dynamic infrastructure daily.
As CISO, in addition to keeping my organization secure, I also need to ensure our security solutions are efficient. IaC helps the people provisioning infrastructure to collaborate and reuse components while meeting cross-departmental goals across multiple teams — cloud operations, IT, R&D, security, engineering, compliance and finance to name a few. It can also help enhance security and data governance compliance.
Even with a standardized IaC workflow in place, including guardrails, IaC isn’t perfect. It isn’t a panacea. In the real world, infrastructure will continue to be changed and updated in response to an organization’s goals and unforeseen events. Cases where the infrastructure state changes and doesn’t match the one defined in the code — a phenomenon known as “drift“— can undercut the efficiency and security benefits of an IaC solution. That makes resource lifecycle management a key concern, and drift detection is a critical component of that.
To maximize the value of Infrastructure-as-Code, it’s important to understand the causes of infrastructure drift, what the impact can be — especially on security — and the best ways to implement drift detection and remediation to help solve the problem.
Drift can occur for many reasons. First off, there may be cases where everyone in the organization is not using the established IaC workflows. That can create unrecorded differences between the infrastructure defined in code and the actual current state.
Emergencies are another common cause. In the midst of a “break-glass incident,” response management teams sometimes decide to bypass standard procedures for patching the infrastructure to fix the problem as quickly as possible. These kinds of shortcuts can cause changes to the resources that are tough to track and resolve in the code.
In addition, basic systems updates on cloud or service-provider systems can also accrue over time, resulting in significant drift as your infrastructure rules and provider systems gradually grow apart. For example, simple API changes (often for third-party services) might affect your infrastructure without being tracked in code.
Finally, cascading effects can make drift detection even more complex. When changing or creating new infrastructure resources, for example, there could be unexpected associated resources that aren’t codified. This creates a cascading effect of changing resource states affecting one another without anyone being aware of it.
As cloud adoption grows, organizational resources and processes become increasingly complex, which can create inconsistencies around the state of the infrastructure. Without standard procedures, notifications or guidelines for adjustments, even temporary changes or the smallest tweaks to infrastructure can have significant impacts on the business, including unplanned downtime, audit findings, security incidents, rework and unused resources.
Most importantly, unrecognized infrastructure drift creates multiple risks that need to be addressed before they become real problems. Drift can dramatically increase the probability of critical data exposures, perhaps due to mission-critical systems left open to public access by mistake or unknown resources left unsecured.
Additionally, development teams unaware of production environment changes not reflected in the IaC systems will almost certainly have to contend with applications “suddenly” crashing and deployment projects that unexpectedly fail.
So, how can organizations best handle drift detection, and what can they do to remediate the situation when drift is detected? Some companies opt to build in-house tooling that checks all states for drifts at once and then sends reports via email to all users. But this makes it difficult to differentiate necessary changes from unneeded ones, since there’s no context behind the changes. Plus, it’s up to you to make the manual changes to the resource or the recorded IaC state. This approach is too time-consuming to be scalable.
The underlying solution to these challenges comes down to answering two key questions:
Ultimately, teams concerned with drift should look for integrated drift-detection solutions. Ideally, this type of system would include all-in-one automated provisioning and central management so development teams can continuously monitor the infrastructure state to detect changes. Operating from a consolidated environment, the system should be able to send immediate notifications to the appropriate teams so they can take specific corrective actions any time a resource is altered.
For CISOs concerned with narrowing security gaps — both the kind they know about and the previously undetectable ones created by infrastructure drift — this type of solution can help strengthen the organization’s overall security posture without adding undue operational burdens.
Specifically, an integrated drift-detection approach could significantly reduce the potential for application downtime that could negatively impact user experience and, eventually, revenue. It can also empower teams to track and quickly address system changes, identify who made them and why, and record those changes for future reference or to adjust the standard workflow as needed.
Finally, a robust drift-detection system can boost operational agility by giving teams a consistent single source of truth from which they can collaborate. Working from the same information avoids the need to buy or develop custom tooling or deal with manual actions to refresh the state — all while granting superior visibility and accelerating time to resolution.
To recap, automated infrastructure provisioning offers significant productivity and security benefits. But what about when your infrastructure changes and the actual state isn’t reflected in the recorded IaC state? Drift is an unfortunate side effect of modern, dynamic infrastructure, where changes are made constantly.
To minimize the impact of infrastructure drift, you need a drift-detection system that gives your operations teams visibility and alerts the appropriate people to take action when needed. Working together systematically under a standardized process with centralized, automated tools promises to reduce risk, deliver greater system visibility and give teams the ability to resolve infrastructure issues more quickly.