Deployment Resilience
Complete Developer Podcast - En podcast af BJ Burns and Will Gant - Torsdage
Deployment of software is probably one of the most painful parts of the whole process, bar none. Deployment considerations have always been a can of worms for software, and have often served to shape the way software is designed, developed, and used in a way that is hard to overestimate. In fact, the success of modern web applications is often attributed to their “relatively simpler” deployment model when compared to traditional desktop and mobile applications. This seems to be true even before you start considering things like mobile and cross-platform considerations. Deployment is often one of the first places where an application runs into serious problems with load, security, configuration, logging, monitoring, and backward compatibility. Making your deployment process resilient is not easy. Not only do you have to potentially make changes to a running application while people are using it, but you have to make sure that you can quickly recover from mistakes while trying to make things fast. A resilient deployment process can best be characterized as one where the users have no idea that anything is even going on with the system during most deployments. Additionally, a resilient deployment system does not have notably degraded performance during a deployment, can be rolled back to a previous state with as little work as possible, and can easily be verified for correctness. It’s a tall order, and is likely to be neglected by the people in charge in favor of rolling out yet another feature. But your deployment process is as much a part of your software as any other feature – after all, if the users are thwarted while trying to do their work, that new feature isn’t going to matter. Deployments can be one of the most challenging parts of writing software. To do them well, you have to walk a tightrope between consistency in application behavior and avoiding application downtime. Depending on what your software does, you will find that you err on one side or the other. However, regardless of which position is better for your organization, getting your deployments right and making them resilient, will make it much easier to deliver software to your end users with a minimum amount of difficulty. Episode Breakdown Limit the surface area of deployments Your source control architecture should not determine how your deployments are structured. You should deploy the minimum amount of stuff and avoid redeploying things that have already been deployed. Besides making the deployment faster, it also limits the number of things that can go wrong, and makes it faster to roll back to a previous state. Towards this end, you should also be limiting what is being built upstream of a deployment (if your deployment starts when a build is completed). This makes rebuilds and subsequent redeployments quicker, making recovery easier if you have to “roll forward” (which we’d almost never recommend). When you reduce the amount of stuff you deploy, it also makes troubleshooting and sanity checks easier, because there are fewer variables to consider. This allows you to validate a deployed build for correctness more quickly. Deploy in parallel and cut over when ready. In a production application, especially one that other people rely on, you should do what you can to limit outages due to maintenance windows. However, if you are deploying over the top of an existing deployment, downtime is unavoidable for the duration of the deployment. While speeding up your deployments can certainly help, it’s unlikely that you can speed them up enough to completely avoid all problems. As a result, it’s often better to deploy to a fresh environment and then redirect the existing traffic to it when complete. This approach also means that the old environment stays untouched while you verify that the ne...