Any company that relies on information technology to function, relies on a well planned technology infrastructure. This can cover anything from email and file servers that support day to day office functions, to key outputs of the business such as websites, databases and software services. With technology holding so many things together, we must build it to be resilient and recoverable, ensuring business continuity in the face of disaster.
We (hopefully) all appreciate that frequent, offsite backups of data are important in order to allow recovery from a catastrophic failure, but we should also treat information about the configuration of our infrastructure with the same care. A lot of thought and planning goes into where data is stored, what kind of resources are required to serve that data, how that data is accessed, and whether there are any mandatory security or compliance requirements. Having no record of this information can severely impact the speed and accuracy with which you can recover from failures, as well as making the maintenance of your servers difficult.
The solution to this is not just to keep good documentation, but rather to use tooling that actually deploys the infrastructure from a set of definition files that can be stored in a version control system like git (meaning all changes can be reviewed and the intention documented).
This paradigm is known as “Infrastructure as Code (IaC)”.
This blog post will be a brief high level tour of how following the IaC paradigm allows you to build robust, resilient infrastructure that will be easily and quickly recoverable.
Before we dive head first into the why’s and the how’s, it’s worth addressing some of the limitations of defining all of your infrastructure as code.
When using cloud platforms you may also find that, particularly for beta features, there are certain things that can’t be configured through the APIs. If you can’t do it with the API, you can’t do it as code and so it will have to be done manually and carefully documented.
There are a few broad categories of tools used for IaC, although there is some overlap between them. The following is an unscientific attempt at classification:
You can also use configuration management tools to pre-build so-called “golden images” which can then be the template for any virtual machine you create. To deploy faster and more consistently, use tools like Packer to build template images rather than automatically building the configuration from scratch every time you bring up a server. Building template images allows you to avoid situations where dependencies are no longer available, which would prevent you from being able to quickly recreate your server.
Infrastructure orchestration tools such as Terraform, CloudFormation (for AWS only), and Ansible (again) hook into the APIs of cloud platforms (and potentially other service providers) to create infrastructure. Examples of things you can define with these tools are:
In a typical setup you will have a Dockerfile that defines both the operating system, and also instructions for automatically building and installing an application and all of its dependencies. This is compiled into an image which can then be run on Kubernetes as a “container”. Kubernetes has its own configuration files (also known as “manifests”) which encode all sorts of information about how the application is expected to be deployed.
Deploying applications in this way means that all dependencies and deployment considerations can be clearly defined in code, and run on a generic platform supported by multiple cloud providers (many of which offer Kubernetes as a service).
The term “snowflake server” perfectly describes what you get when a server is uniquely configured by hand by an engineer (no matter their experience) and placed into a production environment. Unless that engineer has diligently documented every step of the process used to configure and maintain that server, and later configuration changes have been just as diligently documented (we are starting to move into fantasy territory here), it is nothing but a liability. The day that snowflake server fails and needs rebuilding from scratch is the day you learn that the linchpin of your business is a dusty box in the corner that nobody knows anything about.
You can easily extend this “snowflake” metaphor into the orchestration of the infrastructure itself — especially when you are running hundreds or even thousands of machines. For this size of enterprise, hundreds of hours of engineering effort are sunk into carefully optimising the size and shape of the servers to suit different workloads, not to mention time spent analyzing traffic and configuring things like network-based firewalls to ensure security across the infrastructure. In the event of a catastrophic failure, recreating this by hand could take weeks of effort — and even then you won’t know if you missed some important configuration. Changing configuration in such a fragile environment is also likely to break things since the intentions behind the original setup can be lost without being able to dig into history (something that is possible with version control tools).
Avoiding fragile snowflake deployments is key to recovering from catastrophic failures and ensuring business continuity.
Having your infrastructure stored as code allows you to use a version control system like git. This allows you to ensure that changes are peer-reviewed before they are deployed to the live infrastructure. Formalising the review process in this way can help to avoid mistakes being made when changing unfamiliar parts of the system.
Version controlled infrastructure also lends itself to making the final changes as part of a CI/CD pipeline, which can somewhat reduce the possibility of human error.
Finally, your version control system will keep an auditable record of the changes that have been made, and investigating the impact of historical changes is therefore easier. It also makes rolling back to the last good configuration trivial.
On top of the benefit of having all of your existing infrastructure well defined, IaC also makes it much easier to create new, similar infrastructure with a little bit of copy and paste. You could even clone entire parts of your infrastructure into other cloud regions with very little effort.
Being able to quickly clone resources is useful for testing in production-like environments, and discourages testing changes directly on production servers.
Having a strong review process, with small, easy to review changes makes changing configuration much less risky. This higher confidence in making lots of small changes also means a quicker turnaround for otherwise complex changes.
Having all the information about how your servers are configured living inside one person’s head means that if that person disappears, you will be left scrambling when things go wrong. Making sure everything is in code means that anyone else can pick up the torch and quickly get to grips with how things work. This also means that new team members are able to hit the ground running.
Hopefully you’re now convinced that defining your infrastructure as code is an essential part of any business continuity plan. So where do you go from here?
In conclusion, the next time you find yourself logged into a server manually editing the configuration, step away and spin up your favourite code editor instead!