登录查看更多内容

DevOps: Technology Operations in the Bank

Amit Kumar

CTO | Founder | Ex-McKinsey | Ex-GoJek | Ex-Bank Jago | Agile Craftsperson

发布日期: 2022年2月15日

The Fifth and final article of the series of 5 articles describes my journey to build an Engineering setup in Financial Services. Read the?preface.

Building and scaling Engineering team: Principles and practices adopted to build the product and engineering team
Technology choices & Engineering practices: We took an outward-in approach to establish the foundation of technology stack, architecture/design choices & engineering practices
Platform Engineering: A team responsible for architecture reference, engineering tools/ practices, cross-cutting-concerns, DevOps processes, playbooks. Provide platform-as-service to the internal product team.
Infrastructure: The core principle was IAC (Infrastructure as Code). We built a service-oriented structure with Infrastructure as Service
Operations - keep the lights ON aka DevOps: Traditionally, financial services have a dedicated operations team. Following DevOps principle, we followed –?You build it, you run it!

Disclaimer: the scope of this article is limited to System & Application Operations and not all aspects of Operations in a Bank - a much broader subject, to be covered in the future.

Implementing DevOps mindset in a banking setup faces resistance from risk, governance, and regulatory functions. These functions, due to high regulations, create a silos corporate culture. Implementing DevOps later requires cultural changes, hence, based on our past experiences (building/ transforming other banks), we wanted to sow the seed of DevOps from the beginning.

the term DevOps doesn’t need an introduction.

however, in a financial services setup, we made sure regulatory & audit requirements are taken care and implemented accordingly (avoid future. hassles). Our DevOps setup in DKatalis was driven by our core guiding principle -?Build X-as-a-service. We had clear domain boundaries for squads:

registration & sign-ups
payments
transfers
customer & accounts
etc

Squad(s) were responsible to manage operational support for each of these domain boundaries.

Guiding Principles and practices followed

Create changelogs: make sure all deployments to production have a related changelog document with details of 1) steps and version of the artifact deployed (applicable for both - deployment train & hotfixes). 2) steps and version of any data changes (applicable for both - deployment train & hotfixes). 3) documentation proof about sanity testing of critical user journeys in staging. 4) performance and security test results: performance is an automated pipeline. Security is still conducted manually. 5) approval from signing authority in the bank
Setup pagerduty alerts based on journey and domain boundaries:?pagerduty?alerts are to be generated based on domain boundaries. Each respective squad (for their respective domain) will have?ROTA?in pairs.
Monitor and generate alerts for key customer journeys: each squad is tasked to make sure monitoring metrics associated with customer journeys are captured and alerts are generated as per pagerduty setup.
Log monitoring: define clear error codes (business, application, system). Setup alerts and dashboards based on errors codes.
System/ Cloud operations should be managed by Snipers:?Snipers?is responsible for the operational health of the Infrastructure. Practices of PD and Log monitoring is applicable to them.
Document incidents and capture downtimes: all incidents on production should be captured using a post-mortem template. Have a weekly post-mortem cadence to discuss the cause of the incident.
Single monitoring dashboard: create a single monitoring dashboard. This has been one of the challenges since we have different tools for APM, Infrastructure, and network monitoring. Not ideal but, we are pulling real-time data pipelines from these monitoring tools into BigQuery and generating reports using data-studio

Incident Management

As the number of customers grows, it becomes important to have a clear definition of what is considered an Incident on Production.

a planned or an unplanned interruption of a service or services

If we go by the above definition, can we classify a bug that interrupts service as an incident? example:

if a customer is not able to login,?is that an incident or a bug? Probably not, as it could be an issue with the specific customer.
can we classify the same bug as an incident if 15% percent of customers are not able to login? The answer here would be, yes.

Hence, it is important to answer the basic question -?what can be considered as an incident?

We did multiple sessions with Product, Engineering, and Systems teams to outline which customer journeys and under what circumstances, can be classified as an incident. This helped to make sure proper notification and alert setup was configured. This was also helpful to categorise an incident into severity levels - critical, high, medium, low.

An incident can be either planned (scheduled maintenance causing degradation) or unplanned (unexpected degradation or threat). This also includes security incidents

What’s the impact of an incident?

An incident impact is a categorization of the percent of active users who are likely impacted by an incident. We defined the following buckets:

less than 10%
10% - 25%
25% - 50%
more than 50%

During the incident postmortem, reporting a more accurate impact analysis will be needed along with the rationale of how the impact was determined. Considering the multi-stakeholders environment that we have in the bank, it is not easy to have a standard way for assessing impact.

Preparing for production support

Metrics

As explained above, the domain boundaries were defined for each squad. Each squad was responsible to capture the following:

throughput, latency, % of errors (and their breakup), apdex score for each API consumed from the services in their boundary
compute resources: cpu & memory utilisation, network & IOPS,
other metrics: kafka lags, # of timeouts & retries

领英推荐

Next Big Things in DevOps/Jewels in the DevOps Crown

Motivity Labs Inc 2 年前

The Role of Load Testing in DevOps and Continuous…

QA Valley, Inc. 2 个月前

DevOps in fintech: continuous integration and…

Maxcode 3 个月前

Each squad was also responsible to create a dashboard with these metrics which need to be linked to the central dashboard.

On-call support

ensure every squad member has an account setup in Pagerduty
ensure proper levels are setup in PD for each team
ensure paired ROTA is setup from each squad

Escalation policy

Define a clear escalation policy in PD. Some of the important steps to be covered by each squad:

follow outlined steps for adding journeyID for each service API
ensure each customer journey is accordingly tagged with the journeyID (triggers rules for generating alerts in PD).
ensure all error codes are captured for the specific journey

Andre Laksmana?was instrumental to make sure the escalation policy is templatised and gets implemented by all squads. As part of regulatory needs (a public bank), the escalation policy needs review from compliance.

Postmortem

Fail. Learn. Grow.

This has been the fundamental philosophy for forming the layered support model for us. More popularly called?RCAs, we mandated the squads (with domain boundaries) to capture a detailed report with the root cause for an incident on production. Some important aspects included:

time and duration for the incident
customer impact
services and infra components impacted
the root cause (and tag the issue with the associated taxonomy: database, )
fix (short-term and long-term). All long-term fixes go into the backlog and progress tracked
chronology of events & actions that happened during the period
participants

It is important to have sharing sessions of the issues and build a culture of learning. Google describes this as?Postmortem?sessions. In short, it is:

a learning session that discusses the chronological order of actions taken during the incident

The taxonomy was used to build visuals which helped to build knowledge in our monthly cadence -?learning from mistakes.

Future of Operations in the bank – SRE

Ben Treynor Sloss, Google’s VP for 24/7, coined the term SRE -?reliability is critical, a system isn’t very useful if nobody can use it!. Hence, SRE is focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient.

The problem we started to face

We built squads with domain boundaries - which meant, each squad is responsible for building & operating. Each squad was doing ROTA in pairs (with 2 weeks schedules) and the pair on ROTA have the flexibility to choose their working hours for the schedule they are operating. With the increasing number of customers and including some of the investigational mundane tasks, we realised the need for a squad which I call:?the first line of defense?(it is debatable whether they exactly qualify to be called an SRE). The intention was to have the domain squads focus on feature development and operability of the product, while we have a dedicated squad (cutting across all) as the first defense for production. Eventually, this squad will evolve towards becoming a true SRE.

Is SRE similar to Platform Engineering?

I often got this question, aren’t Platform Engineers and SRE doing similar things, why do you need to have 2 separate squads. Platform Engineering and SRE have different purposes for their existence, although, they both are successors of traditional operations teams, and bring the discipline of software engineering to different aspects of operations.

Platform Engineering as defined?here?apply engineering principles to accelerate software delivery

SRE apply engineering principles for the reliability of the product

Additionally, the squad is built with different mindsets of engineers. SRE thrives on crisis management and system outages while Platform Engineers are typical software engineers working on solving complex problems to enable productivity in all aspects of the software delivery lifecycle.

This article brings an end to the 5 article series to build a Financial Services Product organisation with a focus on bleeding-edge Engineering practices.

Rohan Daxini

CTO at LawLytics | Director & Head of India Operations | Founder & ex-CEO at Kiprosh

2 年

This is a must-have series in our "saved-posts" list. Thanks, Amit Kumar for sharing your journey and articulating so well. ??

1 次回应

Vivek Porwal

Executive VP - Global Practice Head Banking at QualityKiosk Technologies Pvt. Ltd.

3 年

Very well articulated Amit and it shows the leadership vision and execution excellence.

1 次回应

Dong Shou

Co-founder & Chairman, ADVANCE.AI

3 年

These are amazing articles. I learned a lot from reading them. Great job Amit.

1 次回应

Iman Kurniawan

3 年

It will be in my reading list. Thx for the sharing

1 次回应

Akshat Rastogi

3 年

Great information Amit!, very keen on your thoughts around DevTestOps and CT. ????

1 次回应

查看更多评论

要查看或添加评论，请登录

Amit Kumar的更多文章

Infrastructure as a Service

2022年1月31日

Infrastructure as a Service

The fourth article of the series of 5 articles describes my journey to build an Engineering setup in Financial…
Platform Engineering

2022年1月13日

Platform Engineering

The third article of the series of 5 articles describes my journey to build an Engineering setup in Financial Services.…

2 条评论
Technology choices and Engineering practices

2022年1月10日

Technology choices and Engineering practices

The second article of the series of 5 articles, describes the journey to build an Engineering setup in Financial…
Building & scaling Engineering team

2022年1月7日

Building & scaling Engineering team

The first article of the series of 5 blog posts describes the journey to build an Engineering setup in Financial…

1 条评论
Building an Engineering setup for a Financial Service Institution

2022年1月7日

Building an Engineering setup for a Financial Service Institution

A series of 5 posts describing the journey to build Product & Engineering for Financial Services I have been part of…

14 条评论
Re-engineer v/s Rewrite

2021年11月24日

Re-engineer v/s Rewrite

We are engineers (can be safely generalised to humans), who at heart enjoy building not extending/enhancing/improving…
Guardrail-ed Autonomy

2021年1月30日

Guardrail-ed Autonomy

I haven’t written much in recent years, at least not through blog posts. I thought 2021, would be a good time to start…

3 条评论

See all articles

DevOps: Technology Operations in the Bank

Amit Kumar

CTO | Founder | Ex-McKinsey | Ex-GoJek | Ex-Bank Jago | Agile Craftsperson

Guiding Principles and practices followed

Incident Management

What’s the impact of an incident?

Preparing for production support

Metrics

领英推荐

On-call support

Escalation policy

Postmortem

Future of Operations in the bank – SRE

Is SRE similar to Platform Engineering?

Amit Kumar的更多文章

社区洞察

其他会员也浏览了

How DevOps Automation Improves Payment System Resilience

How DevOps Automation Improves Payment System Resilience

How DevOps Automation Improves Payment System Resilience

Is AI the key to improving DevOps, or is AIOps being overhyped?

Instant Teaming with DevOps Superheroes

Instant Teaming with DevOps Superheroes

Navigating the Shift from DevOps to Platform Engineering for Enhanced Delivery and Quality

DevOps, DataOps, and MLOps: introduction to Devops

Strategic Approaches to Overcome Docker and Kubernetes Challenges in Modern DevOps

How can DevOps influence Treasury Operations? a discussion between Darryl Claret & Francois Masquelier

Guiding Principles and practices followed

Incident Management

What’s the impact of an incident?

Preparing for production support

Metrics

领英推荐

On-call support

Escalation policy

Postmortem

Future of Operations in the bank – SRE

Is SRE similar to Platform Engineering?

Amit Kumar的更多文章

Infrastructure as a Service

Platform Engineering

Technology choices and Engineering practices

Building & scaling Engineering team

Building an Engineering setup for a Financial Service Institution

Re-engineer v/s Rewrite

Guardrail-ed Autonomy

社区洞察

其他会员也浏览了

How DevOps Automation Improves Payment System Resilience

How DevOps Automation Improves Payment System Resilience

How DevOps Automation Improves Payment System Resilience

Is AI the key to improving DevOps, or is AIOps being overhyped?

Instant Teaming with DevOps Superheroes

Instant Teaming with DevOps Superheroes

Navigating the Shift from DevOps to Platform Engineering for Enhanced Delivery and Quality

DevOps, DataOps, and MLOps: introduction to Devops

Strategic Approaches to Overcome Docker and Kubernetes Challenges in Modern DevOps

How can DevOps influence Treasury Operations? a discussion between Darryl Claret & Francois Masquelier