登录查看更多内容

Top Five Engineering KPI For a Spin Off

Naveen S.R

Heading the AirAsia Move Flights Engineering Team | Engineering Leader | Travel Anchored SuperApp

发布日期: 2023年6月15日

In any engineering organisation, initiatives that we embark on will have to be measured in terms of quantifiable metrics to understand the displacement and the impact on the business.?

Engineering KPIs play an essential part in measuring success over time. While there are numerous Engineering KPIs available, it is crucial to choose the ones that align with the organisation’s maturity level.?

This article focuses on the top KPIs that created value when airasia SuperApp evolved from a startup to a spin-off in the journey of building a travel anchored super-app.

COST?

Managing costs effectively can be a complex and multifaceted challenge, encompassing a wide range of metrics and key performance indicators. However, we’ve distilled the topic down to its most impactful and relevant initiatives -

a. Operational Cost:

In the journey of a product evolving from a startup to a spin-off, optimizing for cost is a crucial step. The tendency for rapid spikes, detailed traces for new product launches, and the choice of infra can burn your wallet and burn you high.

Redefining the infra choices as you grow

Our humble journey on the Google Cloud Platform to build a super app started very small with the choice of Google App Engine when our team size was less than 10. Frequent canary releases to test new release cycles and mono repositories were the need of the hour.?

Quickly when the company vision evolved to become a pure play super app, we had to scale beyond just a booking platform from flights to supporting Hotels, SNAP, Rides, and over eighteen lines of business.

With a tribe model of development and the need for containerised workloads, we moved to Google Kubernetes Engine. Soon our revenue management was on a critical path and demanded high availability and Google Kubernetes Engine regional option was the natural next step.?

As the revenge travel set in and an expectation for 10X growth, we faced a problem to get a large effective SRE team. It made a lot of sense for us to abstract the infra management and Google Autopilot brought value to manage the talent density and infra of choice.?

Google cloud run is another effective container option that lets you scale 1 to 10000 instances in less than a few minutes if you chase to optimize for surge and optimize for cost.?

Choice of infrastructure for growth helped us bring down the cost by almost 40% on GKE.

“Optimisation for operational cost is the objective to maximize your utilisation of the infrastructure for what you pay”

Two cluster design strategy

Identify your core and non-core workloads and migrate all your fault-tolerant non-core workloads to a secondary cluster. The secondary cluster with the usage of Spot VMs ensures very high savings. Examples of non-core workloads - Schedulers to pre-warm caches, Metadata management, Internal audit tools, insights metrics, etc. Usage of Spot VMs can bring down your overall cost by over 50%.

Provision Infra for on-demand usage

Invest early in automated means to create infrastructure on demand for lower development environments. Create schedules to tear down lower environment clusters after the working day or over weekends and provide hooks to developers to create the cluster when needed. Automate flash sale events by designing for your traffic trends to pre-warm and event-driven architecture to scale down min replica by requests per second on the platform. These are very small simple steps that have proven to be very impactful in cost savings.

Logging Cost

Surprisingly, this is a common mistake done by many organizations to log in excess which generally is one of the major cost burners. Log the essential and log actionable insights. GCP and the competitive landscape has a very important feature on the exclusion filter to log only what you need. Use this effectively and you can be assured of a significant reduction in cost up to 90% on the logging cost.

A detailed cheat sheet on optimisation for cost is here.

b. Cost-benefit Ratio

Many times organisations in young stages derail from the purpose of their existence. Technology is a means to solve a business problem. Sometimes due to the engineering team's enthusiasm would get into building tools in-house as opposed to buying ready-to-go proven tools.?

Example: centralised logging, observability, and alerting platforms. Experimentation platform.

“Review Build VS Buy Decisions”

?It is always advocated to buy proven tools and keep up with evolving industry-grade capabilities out of the box. Unless there is a financial cushion for the opportunity cost, support in the organisation to productise & monetize technology platforms, and building a strong development community on initiatives, this strategy is a recipe for failure. Keep a check on the cost benefit ratio before you drown.

领英推荐

The Power of Site Reliability Engineering:…

Infra360 3 个月前

Chaos Testing Explained: A Comprehensive Guide

Keploy ?? 2 个月前

Get to Know a Chameleon

Chameleon Consulting Group LLC 1 年前

c. Cost of delay

The Spotify tribe model is one of the best scalable frameworks to optimize for cost and manage a high talent density. Having empowered autonomous teams who can make data-based decisions saves the CoD - the cost of delay, a key lean management metric that measures the loss of a product's value the longer it takes to take a product capability to market.?

“Give intent to your team as opposed to instructions”

Not many will agree with this strategy, however, we have always kept the team well-oiled and at least 20% below the target resourcing needed. This strategy helps avoid retrenchment when times are tough. The strategy also provides the space for organic growth for individuals who are on an exponential trajectory.

Highly recommended video on Greatness by David Marquet if you would like to get deeper insights here.

Performance

In our experience of building the super app product from scratch one of the key KPIs has always been performance. By a typical shopping funnel -

“Experiments have shown faster a page loads leads to a positive conversion rate”

Amidst the various funnel, search experience performance constitutes to 60% to 70% drop in conversion, and here is what works to tune for performance on the platform:

Soft warm-up of cache with fixed TTL to ensure repeated user searches are served faster
Dynamic TTL cache relevant to your business. Example: In a flight engineering platform on the super app, the lead time to book serves as the basis to define the cache TTL dynamically and smartly.
Optimize the Cache Hit Ratio by moving certain logic to post-filtering layers.?
Configure schedulers to pre-warm top searches.
Design and churn scheduler as a platform where the frequency of schedulers are configurable by various business criteria that can help efficiently warm the search experience.
Tune in for a look to book. Example: Create config-driven schemes to define criteria for your schedulers
Promo fare gardens. When you go in for hot sale events, define promo fare gardens preloaded and you swap out the cache during the sale hour to manage that peak surge.

There will be a detailed blog on the above topic and the link will be updated here. Our initiatives described above helped us bring down the overall API performance time on search by 40%.

The UI/UX tuning is equally important to have great page speeds and the top performance initiatives are listed below:

Enhanced speeds to load content on mobile by using eTags - Blog
Here is a cheat sheet to tune the UI experience for performance - Blog

Reduce Product Support Cost

When the product evolves over years one thing that happens in any company is the pain point to manage the legacy stack from the modern stack. What is modern today eventually will become a legacy someday based on the direction of the company, expectations of growth, and keeping up with industry practices.

Some of the top challenges encountered:

Manage talent density and motivation of the developers to develop off legacy and modern stack with a wide polyglot architecture.
Manage the commercial pressure to continue KLO on legacy and therefore building tech debt on the modern platform.?
Impact gets higher if you are on a PAAS since there are services that will be stopped from being supported, forceful version upgrades and deprecation of many cores supports putting the uptime at risk on the platform.?
With attritions, the challenge cascades many folds higher to keep up with the domain expertise.?

It requires tremendous engineering zeal and resilience to push all stakeholders in the organization to understand these pain points and flatten the tech stack and bring down product support costs.?

“Managing product support cost helps optimize for developer velocity”

Some strategies that can help:

Taking a phased approach in migration always helps than waiting for a full cutover of a platform.
Choose tech stacks keeping in mind the availability of a talent pool to hire and future-proofing architecture at least for three to five years of cycle time. Often engineering teams choose cutting-edge tech stack to up-skill but are not yet proven in the market which could end up in a tech debt sooner than expected.
Identify patterns and automate. We have made consolidator onboarding with automation that brought our integration cycles down from 6 weeks to 2 days. We identified customer support request patterns to engineering for insights and have automated this user journey to self-service options. All these initiatives eventually save the engineering bandwidth and move the efforts to what matters most to biz.

Reliability

When you're moving towards exponential growth, reliability of the platform plays a significant role. Some of the top initiatives that can help have a sound and capable reliability KPI is as follows:

Define dashboards with the four golden monitoring signals as your theme - Traffic, Latency, Errors and Saturation. Keep the signals and insights relevant based on the maturity of the team.?
Create play books to address alerts which enables the support team to refer and self manage the crisis as soon as possible.
Have a circuit breaker pattern defined to ensure you can recover automatically and can self-heal with less or no manual intervention.
Perform chaos testing regularly to be prepared for the unexpected :-)
Establishing service-level agreements (SLAs) with your consumers and internal stakeholders to set expectations for uptime and performance.
Automate as many processes as possible to reduce human errors and ensure consistency in operations. This includes automated testing, deployment, and scaling processes.
Scalability is critical for ensuring reliability as a platform grows. This includes investing in scalable infrastructure, architectures, and design patterns.
Monitoring the performance and health of all components in your system, including databases, caches, and third-party services, to identify and resolve issues proactively.

Synergy in the organisation

Breaking silos in an organisation is a very challenging and daunting task. There would be resistance to change and resistance to innovation as always seen when there is a rapid demand for supporting growth. The manner in which an organization culture enables innovation to thrive and knowledge management to seamlessly flow will define the collective capability for exponential growth.

“There is always more learnings from the valley than the mountain top”

Some of the ways we can instil synergy in the organization that works well are as follows:

Ensure every initiative is well documented in cookbooks that can be reused across the organization. This helps avoid reinventing the wheel and standardise the practises.?
Always think and design as a platform. This helps go a long way in improving time to market. Example: On how we built a COSMOS platform to help catapult the launch of many lines of business - Link
Socialize the engineering wins to all stakeholders with data and well-defined success metrics.
Set stages for your team to shine and show their hard work. Some may consider this as bragging, however, this is a huge motivation for budding talent to talk about their key accomplishments. Sunshine failures as well proudly. There is always more learning from the valley than the mountain top. Read more here on why this is so important - #IamRemarkable
Encourage teams to write blogs, conduct external webinars, attend summits, and contribute back to the community. This is the best form of sharpening competency and pushing the benchmark higher and higher.
Influence stakeholders in a positive way when you face resistance. Never go personal and stay focussed on the objective. Eventually, persistence will get the change that you require.
Exercise patience when you are trying to sell innovation to non-technical leaders. Try a simple analogy that helps put the concepts through in simple layman's terms.
Learn from the salmon fish that swims upstream and navigates in such a way as to not get eaten by the bears. Stay fearless of politics or retaliation, and keep the larger objective that the organization always can win as one and do what it takes to foster synergy.

Peder Enhorning

CEO and founder of multiple startups. KPI Karta helps you visualize your strategy and informs what needs to get done.

1 年

Yes, KPIs are great but they can cause a great deal of consternation for companies and their teams. Numbers are being asked of people that don’t seem to make much sense. That’s because they are reused from the past, or chosen from someone’s top-10 list. But they don’t then align with overall goals, resulting in KPIs that encourage behaviour inconsistent with what the organization is trying to accomplish. And so KPIs cause frustration. Try using our new tool, KPI Karta, which builds hierarchical, color-coded maps showing how work is directly connected to your goal, producing more effective KPIs and then lets you track them in real time.

Ruhina Yeasmin

User Experience Designer | Ex-Motorola Solutions | Ex-AirAsia | Ex-Amazon

1 年

This is so interesting Naveen . Lot of information to learn. Thank you for sharing this.

Shanti Kurupati

Director Product Development at Intuit

1 年

Great insights Naveen S.R !

查看更多评论

要查看或添加评论，请登录

Naveen S.R的更多文章

Scheduler as a Platform

2023年8月27日

Scheduler as a Platform

Authored by Sai Vamsi Alisetti An essential component of every business is to schedule certain tasks for automation…
From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

2023年7月16日

From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…
Canary across multi-cluster with Anthos Service Mesh Part 3/4

2023年7月16日

Canary across multi-cluster with Anthos Service Mesh Part 3/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…
Scaling with Efficiency: GKE Autopilot Part 2/4

2023年7月16日

Scaling with Efficiency: GKE Autopilot Part 2/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R In our previous article, we explored the…
Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

2023年7月16日

Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

Authored by Nayana Madhav, Rahul Prajapati, Tushar Bhattacharya and Naveen S.R AirAsia started off as an airline and in…
Building a self healing platform

2023年6月24日

Building a self healing platform

Authored by Nalina Madhav c and Naveen S.R At AirAsia, we believe that travel is more than just a journey from one…

3 条评论
Incident Management With Playbooks

2023年6月24日

Incident Management With Playbooks

Authored by Nalina Madhav c and Naveen S.R Part 2/3 to build a self healing platform Alerting for Timely Incident…
The Four Golden Signals

2023年6月24日

The Four Golden Signals

Authored by Nalina Madhav c and Naveen S.R Part 1/3 to build a self healing platform MTTR: The Key Metric For Assessing…
Unleashing the airasia super app

2021年6月8日

Unleashing the airasia super app

AirAsia.com as an airline we have 65+ million customers who come into our platform day in and day out shopping for our…

18 条评论
4 Myth-Busting Stories of Finding Extraordinary Talent

2021年5月21日

4 Myth-Busting Stories of Finding Extraordinary Talent

How does a fast-growing super app hire the best people for the best teams to do their best work? The airasia.com…

3 条评论

See all articles

Top Five Engineering KPI For a Spin Off

Naveen S.R

Heading the AirAsia Move Flights Engineering Team | Engineering Leader | Travel Anchored SuperApp

COST?

a. Operational Cost:

b. Cost-benefit Ratio

领英推荐

c. Cost of delay

Performance

Reduce Product Support Cost

Reliability

Synergy in the organisation

Naveen S.R的更多文章

社区洞察

其他会员也浏览了

Announcing the Speaker Lineup for PulumiUP

Meet CTW’s new VP of Engineering, Cao Zhe (Part 2)

The Rule of Three, About K8s' Future, Team Facets at KubeCon Chicago

Scaling SRE in Growing Organizations: Key Strategies for Success

Themes that Shaped IT in 2022

Containerized Applications and SRE A Match Made for Reliability

The Role of SRE in Facilitating Engineering Cultural Change

Elevating Your Business with Platform Engineering

Scaling SRE in Growing Organizations: Key Strategies for Success

The Hype About Platform Engineering: Echoes of the SRE Revolution

COST?

a. Operational Cost:

b. Cost-benefit Ratio

领英推荐

c. Cost of delay

Performance

Reduce Product Support Cost

Reliability

Synergy in the organisation

Naveen S.R的更多文章

Scheduler as a Platform

From GKE Standard to Autopilot: Lessons Learned, Surprises Embraced, and a Few Facepalm Moments Part 4/4

Canary across multi-cluster with Anthos Service Mesh Part 3/4

Scaling with Efficiency: GKE Autopilot Part 2/4

Accelerating Innovation - Journey to GKE Autopilot - Part 1/4

Building a self healing platform

Incident Management With Playbooks

The Four Golden Signals

Unleashing the airasia super app

4 Myth-Busting Stories of Finding Extraordinary Talent

社区洞察

其他会员也浏览了

Announcing the Speaker Lineup for PulumiUP

Meet CTW’s new VP of Engineering, Cao Zhe (Part 2)

The Rule of Three, About K8s' Future, Team Facets at KubeCon Chicago

Scaling SRE in Growing Organizations: Key Strategies for Success

Themes that Shaped IT in 2022

Containerized Applications and SRE A Match Made for Reliability

The Role of SRE in Facilitating Engineering Cultural Change

Elevating Your Business with Platform Engineering

Scaling SRE in Growing Organizations: Key Strategies for Success

The Hype About Platform Engineering: Echoes of the SRE Revolution