Our Ephemeral dev environments -driven by Slack
In one of my recent posts (https://www.dhirubhai.net/pulse/how-we-deploy-production-avi-zurel/), I shared how we deploy to production.
Obviously, that is the last step in a long process that required very hard work from a lot of talented people.
One of the drivers of this change was our ability to test code changes in isolation, run automation on them, and move on with confidence to the next stage.
Prior to this change, our ability to test features in isolation was non-existent.
Why is there such a testing challenge?
Globality has a distributed system. Lots of microservices talking to each other. From the frontend to the backend, through messaging systems and buses.
To fully test a change, you need to deploy or simulate a fully working environment.
Yes, this is not ideal. In a perfect world, we would not even need this. This is the direction we are heading for, but at this point in time, we have to work with this limitation.
The before
Before moving to CD, we would deploy every 2 weeks. During the 2 weeks, everyone would deploy to a single `dev` environment. We would then move to an integration environment with all the changes that were committed to dec in those two weeks.
This proved to be challenging and frustrating.
We would end up with features that are not fully ready by the rigid cutoff time. Even when we tried to be flexible with the times, it still did not work.
The solution
When we started the transition to CD, we knew we need a testing mechanism that will allow engineers to test full features in isolation, without feature X affecting feature Y.
Our theory was that this would bring predictability and more stable features.
devX
We now have X number of dev environments. Each isolated from each other.
You ask a slack bot for a dev environment like this:
@glo dev for feature/GLOB-1234-feature-name
@glo will reply
I assigned dev6 to feature/GLOB-1234-feature-name
Now. Anything you push with the branch name matching will deploy to the dev6 environment. It is a fully functional environment. It is 100% similar to production (obviously scaled down). It has the same message buses, etc.
We control the cost of these environments by having a smart cluster solution: (I wrote about it here: https://www.kensodev.com/posts/2020/03/24/globality-flexible-cluster-management-solution/
By default, environments live for 3 hours. After that, the environment is shut down and available for assignment for another branch.
Testing
We have UI automation. With the same bot, you can also run automation on your environment. Making the environment "green" is the only way to move forward to the next step.
Summing up
Having these ephemeral environments allowed us to move quickly to CD. It is making sure we have more stable features and deliver dates. It removed the dependencies between teams working on separate features, etc.
Questions? Comments? Happy to discuss