Who is responsible for Monitoring?

Who is responsible for Monitoring?

There is an age long question on "who is responsible for monitoring" that is recurrent in a lot of tech, business, product and incident meetings that it has now become a cliché to the extent that we cannot answer the questions correctly without an argument due to the immense transformation of how todays Microservices have evolved from the Monolithic architecture we were running and the past and still running in some cases.

"Who is responsible for monitoring has now become a cliché"        

In a lot of cases I have observed that these questions hardly come up during product designs, sprints or prior to deployment where we can easily set traps, events, etc while identifying teams who will be responsible for monitoring as the product goes live. I have also observed that in some cases, we hardly transfer the discoveries of the application testers and quality assurance analysts through their work on APIs prior deployment to production to the monitoring teams to help in monitoring or observing performance there by impacting the teams learning curve, causing extended downtime and in some cases introducing software engineers into incident calls.

However, if you consider the Monolithic architecture, all you needed to do was monitor the virtual or physical machine (CPU, Memory, drives) hosting the application, the application webserver and the network then you are on-spot with monitoring without understanding the intricacies of the Monolithic application. During some incidents, all that is required is a of restart the server or webserver when you suspect a load which typically always triggers the resources, then the monitoring starts getting complicated with load balancing the; since it results into several nodes for same application then more complications when the same Monolithic application is converted into Microservices where same application is broken down into several components with potentials of running several instances on same node and across multiple nodes with multiple dependencies using #Kubernetes, #docker, #OpenShift.

No alt text provided for this image

When asked at some point during meeting "who is responsible for monitoring" I said everyone and further explained that although there is a central monitoring team, it was everyones responsibility to observe. I further explained with the. below;

  1. The product manager/owner should have a view of how his/her product is performing.
  2. The relationship managers should know how their customers are interacting with the product.
  3. The Architects/Software engineers should have routine checkins for their software to take feedback.
  4. The monitoring team should detect and resolve earlier.
  5. The list goes on.

To imprope effectiveness in response to incidents or improve performance on Microservices, there has to be a shift from just monitoring to observability.        

It is worthy of note that observability is a journey, however the below are key to improving observability;

Steps to Improve Observability

  1. Service Level Agreement (SLAs), Service Level Objectives (SLOs), Service Level Indicators (SLIs) must be setup.
  2. Central Log Aggregation
  3. Baselines must be set based on SLAs, SLOs, SLIs.
  4. Targets must be set to continously improve SLAs and tune down errors.
  5. The monitoring teams must set intelligent alerts to avoid alert burn out. Are the alerts timely? are the alerts relevant?

Alert burnout is the experience of a lot of teams as they are unable to act on critical issues due alert burn out.
Recall "the fokker 50 crash crew ignored multiple alerts" possible due to alert burnout.        

6. Intelligent and dynamic health endpoints for services.

7. Continuous product testing even while in production.

“If you don’t like testing your product, most likely your customers won’t like to test it either.” 
                             Anonymous                           

8. Continuous dashboard creation to make detection easy even when there is no incident.

9. Transfer of knowledge of the Microservice prior to deployment to production from testers/QAs to the monitoring team and enshrining through dashboard.

Imagine a scenario         


Summary

In my experience of reviewing incidents, some things like the below stick out easily and you can predict them to be knowledge gap, late discovery and other similar cliché and you may wonder why same recurrence in a lot of incidents.

However, it is important to note that while and incident is ongoing with calls blaring and emails piling with customer complaints, what matters is have a team with a mind shift from the regular monitoring to observing the the entire value chain, follow the complaint, the logs, events timestamps and hypothesis while we subsequently create dynamic dashboards to potentially solve the problems before the customer notices a downtime.



Eseosa Otasowie

Software & Cloud Infrastructure Engineer

3 年

Nice article Patrick on this particular question. Brings home memories in LP. ?? To drop my 2 cents on the matter. Better adoption of a DevOps methodology helps make monitoring more useful and effective. The alerts should be channeled into a pipeline for service improvement.

Fatade Olufemi

Chief Technology Officer at Jehoidatech

3 年

Just as there is a team for a task, there should be a team for monitoring also. Monitoring is a task on its own and someone should be responsible for not tracking logs before disaster occurs. Over the years, there is an alert or log that is triggered before a disaster occurs . The question is who should be responsible? While troubleshooting mail issues on users' workstation, I have seen many users who deliberately configured deletion or moved monitoring alert to another folder.

BENJAMIN I.

Solutions Architect | DevOps & Cloud Engineer | VMware Tanzu Specialist (TKGI) | Containerization |

3 年

Well done Patrick for a well thought through article, however I do agree with you very strongly that monitoring should be a job for all as well as security is a job for all. I think monitoring should be a collaborative task that we all should be involved in with a high sense of ownership and developing a good responsive approach as well. Monitoring is every day life that we live just like we keep track of events as they unfold in our daily activities, so also we should keep track of event being generated in our environments. Yes alerts can be overwhelming if not properly filtered to report the right metrices but we owe ourselves that sense of responsibility to be intelligently aware of the events / incidents that happens in our day to day work life as Manager, product owners, Architects and NOC support team. No one drives his car to and fro without keeping his eyes on the dashboard at certain intervals to monitor the fuel level, temperature of the car etc.

要查看或添加评论,请登录

Patrick Okebu的更多文章

  • Leadership is not a Straight Line

    Leadership is not a Straight Line

    In my journey of #leadership, I have worked with quite a number of leaders and still work with a lot of leaders so it…

    1 条评论
  • Payment Options - Preferred Method and the Experience

    Payment Options - Preferred Method and the Experience

    I carried out a survey on preferred payment options mostly responded to by #Nigerians to enrich this write up though I…

    4 条评论
  • Automating to Improve Business Process

    Automating to Improve Business Process

    I have deliberately titled this piece automating to improve the business process so we are conscious of the cliche…

    1 条评论
  • A Tale of Three Cities, Different Culture. but One Architecture

    A Tale of Three Cities, Different Culture. but One Architecture

    Introduction Over the years, I have been someone who loves #enterprise #architecture and my work is still being…

    3 条评论
  • Strategy - The one that works and the others

    Strategy - The one that works and the others

    Strategy was derived from a greek word related to war so this is a serious matter ??. I am not a strategy expert but…

    9 条评论
  • Organizations; Learning from Trees

    Organizations; Learning from Trees

    While on my early morning #run a few weeks ago, I saw a tree #blossoming but at the same time a large portion of the…

    6 条评论
  • The Place of Leaders Identifying Talent

    The Place of Leaders Identifying Talent

    When I have the opportunity to speak to young people; I try to explain that I didn't just drop into my positions. I…

    5 条评论
  • Agrotech Opportunities - Birds

    Agrotech Opportunities - Birds

    As someone who's has been in the Fintech space for over a decade and has been to different parts of Nigeria working to…

    5 条评论
  • Deep sea Fibre Cut Internet Outages and Next Steps for Nigeria

    Deep sea Fibre Cut Internet Outages and Next Steps for Nigeria

    At about 8:47am GMT+1 on Thursday 14th March, 2024, we received notifications on our expressroute not knowing that it…

    10 条评论
  • Data Duplicity in Nigeria; yet Inaccessible

    Data Duplicity in Nigeria; yet Inaccessible

    I have been in several conversations across different forums where some people say it is difficult to identify people…

    4 条评论

社区洞察

其他会员也浏览了