What to do when you break production

The feeling of breaking production can be one of the worst you can feel in your professional career. There are multiple strategies you can follow (from a hotfix to a rollback) to fix that little bug 🐛 that gave your production environment the hiccups.

What does it feel like to break production?

Everything is going well at work. You go to meetings like every other day and get down to solving the day's tasks without knowing what's coming.

In the local environment everything looks good. QA has given you the review, all tests passed and your pull request is ready to see the light of day.

After a long day of work you feel happy to finish and finally release to production the code you've been working on.

The deployment goes well and you feel great until... you see an error that's breaking the platform. You've broken production for the first time!

Breaking the platform you work on for the first time is possibly one of the most unpleasant feelings you can have. Your stress shoots through the roof, you feel like there's a clock running that will announce your firing and you feel like a failure. Believe me, it's not the end of the world and everything will be okay.

How to fix production when it's already down?

The first thing you need to do is calm down. Understand that you're human and you'll make mistakes throughout your professional career.

Once you're calm it's time to create a strategy to fix what's broken together with your whole team.

1. Identify where the error report is coming from

Reports that an error is in production can come seconds after the deployment finishes or hours later when multiple users find collateral failures in the platform.

You can find the source of the failure through user reports on social media like Twitter or Reddit. Also take a look at the system logs to find the source of the failure.

Failures are also often reported by your company's customer service department, which usually communicates the problem directly to the development team.

Once you're clear on what's happening, it's time to fix it with your team.

2. Communicate to your team that there's a problem

Errors that break production have high priority. So the simplest way to communicate with your team to solve the problem is to create a war room—a call where developers from different teams are discussing what would be the best solution to fix the problem.

3. Define a strategy to fix the problem

Once you're on the call with your team, determine what would be best to fix the problem. Usually there are three options: a hotfix, a revert or a rollback. Here's what each one is for.

Hotfix

As the name says it's a "fix on the fly"—a quick fix that's sent to users immediately.

This is a very good strategy when you've already identified the problem and know how to fix it.

Just fix your code and deploy.

⚠️ Make sure your fix works if you have cache policies or if a change was made to a table model in the database
Revert

The revert consists of undoing the commit or series of commits that went out in the last deployment. This is done with git's revert command.

This solution is good when you're not clear on what's failing in the system or you know the fix will take a lot of time and effort.

Doing a revert will create a new commit that marks the changes as undone. Just deploy that commit and you're done 🚀
Rollback

Doing a rollback is common when we did a database migration wrong or incorrectly modified a model.

This is one of the hardest solutions to execute but necessary for extraordinary cases—like a feature that was worked on by multiple teams and went wrong. The term rollback is usually used for databases but nowadays it can cover many more types of systems.

The rollback consists of undoing the changes made to the system not only at the code level—it also means going back to previous versions of other parts of the system or infrastructure, like the database or third-party systems used by multiple teams.

Doing a rollback is common when we did a database migration wrong or incorrectly modified a model.

It's a solution that requires a lot of orchestration to return the system completely to its previous state—this is where the war room really shines.

4. Verify that the system is up and fully functional

Whatever system was affected, once your fix is available run something called smoke tests. These tests consist of verifying the main functionalities of the system and making sure the error is no longer present.

5. Keep an eye on the console logs

System logs are those little messages the console shows when an action happens or something runs—like a request or an error. There are services dedicated to managing and monitoring this, like Sentry or Datadog.

When your fix is in production, keep an eye 👁 on these systems to make sure no new problems have come up.

What to do to avoid breaking production in the future?

Well, once you've solved the problem with your team I strongly recommend writing a document called a post mortem ☠️.

This document is written like a log where the incident in question is described and in detail, minute by minute, what happened is explained. From the initial report of the problem to the moment when the system was completely stable and working one hundred percent. 💯

Important information this document can include:

Time of the incident
Number of users affected
Systems or functionalities compromised
Time the system stayed "down"
Team members involved

Finally, understand that these things happen and it may not be the last time it happens to you. It's part of your formation as a professional and I'm sure you'll learn a lot from these errors.

Try to have enough psychological safety with your team to speak openly about these failures and grow together. ✨