Infrastructure : A Customer First Approach
Akash Saxena
CTO@JioHotStar | CTPO@JioCinema | CTO Excellence Award 2024 | ex-CTO Hotstar[Asia|MENA|SEA] | ex-OpenTable
TL;DR: DevOps are now integral to any Engineering team that wants to optimize the time it spends "fighting the plumbing". Here I talk about how a customer centric approach is critical to DevOps success inside an Engineering team. Your customers are your engineers and infrastructure teams should ignore them at their own peril.
I feel the need, the need for speed!
The Un-Expected Duct Tape
As an engineering leader, an aspect of hygiene is how much time your team spends on “fighting the plumbing”. I want my engineers focused on solving customer problems and delivering value constantly to the organization. One aspect of delivery is just the design, coding and quality part. The challenges here are around having the right environment setup to develop and test the feature, data tends to be a challenge in this phase. The secondary aspect is the actual deployment and "production readiness" of the feature.
Production readiness gets us into :
- Scoping and understanding performance and scalability requirements
- Ensuring the product runs on the right hardware spec
- Embedding the right monitors to ensure that we have eyes on the system when things go south
- Ensuring that the quality team is able to test in a “near-production-like” environment so that risk is mitigated earlier in the cycle
Here are a few themes which you might have heard:
- It’s taking me a long time to setup my environment to develop this feature
- This feature requires a lot of configuration, so just test it against my machine instead
- I can’t test the feature till Engineer X finishes her testing on our shared environment
- We missed the bug because the feature environment was not updated
…and so on.
The part about how software gets tested and gets to production is so critical and optimizing this part to be drama free becomes very important. Often these steps are held together by duct tape type solutions. Till you have a dedicated infrastructure team or are attributing enough cycles to this process, the steps remain a bit clunky. Attempts to improve them are ambitious projects which often overlook adoption and team biases which will cost you time, effort and morale!
Adoption or Death by Combat!
These are quality problems to solve though. You want to go fast? Get these roadblocks out of the way first. There are so many ways to crack this problem and here is where I’ve seen teams getting lost and then ultimately embark on Siberian death marches. You got to watch for this fork in the road. Talk to your engineering team and don’t assume that everyone is super motivated to embrace DevOps and run that twenty step process each time they need to spin up a new environment.
Figure out honestly where your team stands and what the level of adoption might be. If you have a small and focused team where the team has been exposed to a DevOps culture, then adoption is going to be much easier. However, in most teams, there is probably just a fleeting knowledge of how a DevOps culture is for real. You really have to live it to know it, reading about it does not help beyond a point.
Pick the path of least resistance, make it super simple for your team to embrace the infrastructure changes that are planned. However, what tends to happen is that the infrastructure team will ideate and get things working. Then, publish large documents / Wiki articles and then follow them up with talks showing off what was built.
Furthermore, every developer must now setup the tools and environment needed to use the “new” way. The rollout is not thought through and when people start using the “new” way, things break and fixes are non-trivial and hold things up. However, there are deadlines to meet and soon, people will find ways around the fancy process you setup.
Ultimately, your team is going to give up and circumvent the changes. The infrastructure team feels cheated because they think the Engineers are ruining their big moment and ultimately its a lose / lose for everyone. If this sounds familiar, then you didn’t think about your team before you rolled something out.
You forgot to make your infrastructure “production ready” and treat your team as your customers. It’s also educational because it’s a great case study on rolling out ill thought out product changes that we just assume that our real world customers will use without question.
Never be presumptuous about our customers, that has costly side effects.
Infrastructure As Electricity
In one of our discussions about infrastructure, someone in the room said “Infrastructure should be like electricity, when you turn a switch, you never doubt that power will flow through”. That characterization of infrastructure has stuck with me. Sums up exactly how we need to approach the adoption of infrastructure changes in our teams. It’s got to be as easy as flicking a switch.
The Heroku model is very nicely done, simplicity itself. Push your code = deploy your code. Now, that’s a model to aspire to. I’ve had good success with using a Chat-Ops approach. We created a Slackbot which we built using Hubot and got the whole team to use a dedicated Slack room to manage our deployments. In the back-end we started by implementing a v1 that we could deploy quickly using some nginx magic so as to use a large box to power all our deployments. The system worked out great. Our ultimate goal is to docker-ize the setup, but all that would be transparent to the end-customer, our engineers.
Different environments were created to mimic various degrees of production separation. We started out sharing databases and caches, but those could be hived off easily as well using a per environment configuration management store. The environment was hosted on a powerful shared server and idle processes didn’t take up resources forever. Separately, we were scripting our environment with Ansible so as to stand it up easily. Once this type of setup is operative, any issues are P0’s because your team is going to be blocked.
Adoption was a breeze, because there was nothing to install or learn. it was as simple as sending a chat message. "deploy this branch to this environment.” We hit all our goals of making deployments into environments faster, making the process like “electricity” and dropping our AWS costs that were applied to setting aside machines to host test environments. More importantly, the infrastructure team was free to evolve the back-end at their own pace without materially affecting the team. We’d just add more options to the chat-bot and the team would be none the wiser. If anybody was keen on figuring out how things worked, they did it out of their own interest rather than being forced to understand the nitty gritties, just to use the system.
Flick that Switch
You’ve got some great tech to pick from - Docker is awesome can really speed things up, but you got to know what you’re doing. Docker adoptions require a a good thinking through. It’s better to get your process down first and then migrate to Docker, in my opinion. Ultimately, regardless of the technology you pick, be it Vagrant and images / Base AMI’s driven by a CM system or Docker, the pattern is pretty similar.
The meta point is to always think about how it’ll be used first and then work your way back to the technology that will power the experience. Infrastructure done right is like a drug. Once your team is used to it, you’ll wonder how you ever got anything done without it! So, always think about your customer first and then work your way back to the technology bits, you can always iterate and improve on the engine that nobody sees, get the interfaces right first!
Smart City Innovator | Empowering Urban Management Through Integrated Infrastructure Solutions
8 年Well said! Often, internal QA and engineering infrastructure is second tier. Most of the energy is being spent on upgrading and maintaining customer facing infrastructure. Without proper test environment, we let customers find bugs .