Only Human - How to Prevent the Most Common Compliance Violations

Earlier this year, GitLab experienced a huge database incident that resulted in the permanent loss of 300GB of customer data and an 18 hour outage.
GitLab was very open and apologetic about the incident, citing a combination of human and system errors for the catalog of disasters. You can read the full post-mortem analysis here, but to cut a long story short, an engineer had a lapse in concentration and pressed the wrong button, an action that was further perpetuated by a failed backup process that nobody had picked up on.

Developing a Continuous Compliance Framework

To err is human, but propagating errors that affect uptime and customer satisfaction is avoidable if you have a solid continuous compliance framework in place. One of the biggest threats within a DevOps environment is change, such as configuration changes, software changes and network changes. If managed poorly, change can cause downtime and instability. By managing change effectively and building a continuous compliance framework, you can prevent the most common ops mistakes before they have a chance to compromise your systems.
Here are some of the common factors that can lead to compliance violations and what you can do to prevent them.

1. Keep Test and Production Environments Separate

In the case of the GitLab engineer mentioned above, he didn't realize or remember that he was working in the production environment when he actioned the change. For this reason, it is essential to create test and production environments on different hosted environments with different user credentials. Test environments should replicate production environments as much as possible for consistency and accuracy while staying completely separate from each other.

2. Incorrect Configuration Management

When using infrastructure-as-code, it is inevitable that some server configurations will pose problems. For this reason, it is essential that you have a means of restoring servers as quickly as possible. Solutions like Chef offer a simple infrastructure-as-code deployment solution for ops teams, enabling them to restore servers in moments.

3. Inconsistent Deployment

Consistency is everything during testing and deployment. Every piece of code needs to be tested and deployed in a similar manner wherever possible. While standardization of your deployment methods may take time, once the foundation is in place it is repeatable, scalable and predictable. You should also ensure those with deployment rights have the rights to roll back any code they deploy.

4. Third-Party Vendors or Partners Fail

While you may have worked hard to have your own compliance house in order, things can fall apart when you partner with a third party that isn't so stringent in their approach. For this reason, it is essential that you work closely with any third parties, determine SLAs in advance and ensure they are working to the same high standards as your own in-house teams.

5. Automate Routine Tasks Where Possible

It is often the manual routine and time-consuming tasks that can fall victim of human error or that hold up the continuous delivery pipeline. Humans become complacent with the things they do day in and day out, and mistakes can happen. But by automating some of the most common DevOps tasks, you can take control of these elements and do more to ensure downtime and errors are eliminated.

Final Thoughts

DevOps will never be free from all errors. Whether they are generated by humans or systems, they will always occur to some degree. However, being able to restore, roll back or rebuild environments, automate common tasks and prevent errors from occurring in the first place will help you to significantly improve performance, uptime, deployment pipelines and the customer experience overall.