登录查看更多内容

Monitoring Systems with Prometheus - Introduction

Shrey Batra

CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)

发布日期: 2023年2月16日

Hello everyone, after months writing the first edition of this newsletter! Lets start first with an absolute mind blowing architecture of - how to monitor your distributed system with prometheus.

Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach.

Pull Based Approach?

Yes, you heard it right. Prometheus uses a pull based approach rather than a push based approach. Who said pull based approach does not work, when rather Borgmon (Google's monitoring system) scales to a global environment with tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.

How does Prometheus's System Work?

The following below is the full scale Prometheus setup running in actual production workload along with so many different components..!

No alt text provided for this image — Sourced from Prometheus documentation

Let's start small and break down the actual architecture step by step -

If you see the actual diagram, the smallest part of architecture you can identify is the following components -

Prometheus Server
A TimeSeries database used by Prometheus (TSDB)
A Retrieval process which actually pings and pulls the metrics.
Job / exporters known as Prometheus targets.

Let's talk about the basic system

As we saw above, a set of minimal components is needed to understand Prometheus on Day 0.

As you see, the Prometheus Server pulls the data by polling multiple targets -- the services you want to monitor. Each of these service must expose a /metrics API endpoint, with Prometheus enabled data format to expose the metrics to Prometheus.

领英推荐

Driving Observability in Modern Systems

Irfan Azim Saherwardi 7 个月前

? Network topology in a non-intrusive way, etcd should…

Learnk8s 6 个月前

Application modernization success story!

AlifCloud IT Consulting Pvt. Ltd. 1 年前

You can easily add a middleware to enable Prometheus metrics in any framework you work in -- Flask, Spring, FastAPI, Django, etc.

Once your targets (running service instances) exposes these metrics, Prometheus then runs a Scrape Job every X time (configurable) to pull the metrics and store it with its own time series database.

Now, when you have the metrics -- such as Up Time, CPU usage, RAM usage, you can now monitor, query and add alert mechanisms on top of Prometheus' UI and quickly accomplish your goals..!

An Example - Calculate cost by exact monitoring

You can monitor how much cost you are incurring by querying the uptime of your Backend Service (for ex.) by just querying Prometheus to say -- give me the total uptime across my different backend containers (docker containers) over the last 1 month.

Let's say you have Container 1 running for 2 days, Container 2 running for 5 days and Container 3 running constantly for all 30 days of the month.

Prometheus can query the "health check" data it had polled while collecting (lets say every 5 seconds) and then say that -

Container 1 had 34560 successful health pings (1 for every 5 sec it was up). This means that 34560*5 seconds == 2 days uptime.
Container 2 had 86400 successful health pings which is 5 days worth uptime.
Container 3 had 518400 successful health pings which is 30 days worth uptime.

Now, your total uptime for your backend service =

34560+86400+518400 health pings
or 649360 health pings
or (649360 * 5s) = 32,46,800 seconds

Now, you can easily calculate your total cost (lets saw AWS charges $0.01 for every 60 seconds / 1 hr, given some CPU/RAM)

32,46,800* 0.01 = $32468 per month

Does Prometheus Scale?

For sure! Obviously, a single Prometheus server cannot be said fully reliable as well as cannot scale when you have millions of machines / servers / containers running. We will see how to scale Prometheus (pull based system) for a fully distributed system and how well does it do ??

Don't forget to Subscribe this newsletter, so that you don't miss out on the next edition! If you liked this article, show your ?? by liking/commenting on this post!

System Design & Architecture

49,143 位关注者

Pradyumn Verma

Building Fielddrive | IIT Jodhpur Alumnus

2 年

Shrey Batra which metric will be good to monitor kafka latency?

Shrey Batra

2 年

Subscribe to my newsletter - https://www.dhirubhai.net/newsletters/system-design-architecture-6871521381876584448/

1 次回应

Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

Short and sweet introduction to Prometheus ?? I've been wanting to start learning about this. Quick question: you mentioned that service needs to expose a /metrics endpoint. What if the service itself is down? Do we run this end point in a side car?

1 次回应

查看更多评论

要查看或添加评论，请登录

Shrey Batra的更多文章

Instagram's trick for faster photo uploads and beat competition

2025年3月15日

Instagram's trick for faster photo uploads and beat competition

Instagram's Co-Founder Kevin Systrom, in an interview, shared how Instagram actually tricked users for greater…

8 条评论
How to break a system in Microservices - The invalid myths and the best practises

2025年1月31日

How to break a system in Microservices - The invalid myths and the best practises

People often think that 1 Microservice is responsible for 1 feature. And this is how you create the most inefficient…

5 条评论
How to be a SENIOR / STAFF engineer and highlight your impact?

2025年1月23日

How to be a SENIOR / STAFF engineer and highlight your impact?

How do you grow, other than learning new coding skills? You need much more to be a SENIOR engineer !! These concepts…

4 条评论
Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

2025年1月20日

Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

There was something cooking in Cosmocloud Labs, and finally it is out! Very happy to share that Cosmocloud Deploy is…
Using Redis as a Notification Service?

2024年12月16日

Using Redis as a Notification Service?

Only with Production Experience you can know that Redis can also be used as a notification system between multiple…

6 条评论
E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

2024年12月1日

E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

Under the new series of "Devops & PaaS Platforms", I am evaluating various different platforms on how easy it is to…

5 条评论
E02 - Finding the best Devops & PaaS Platforms - AWS ECS

2024年10月24日

E02 - Finding the best Devops & PaaS Platforms - AWS ECS

Under the new series of "Devops & PaaS Platforms", I am evaluating various different platforms on how easy it is to…

4 条评论
SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

2024年9月19日

SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

In today's fast paced world, technology has become the backbone of successful logistics operations, and at SMILe, we…

5 条评论
Building a Custom Link-Clicks Tracking System

2024年8月23日

Building a Custom Link-Clicks Tracking System

Last blog we saw how to create your own Event Tracking System, where we saw how we can track our own Page Views and…
Databases & Platform Mentorship Program

2024年8月21日

Databases & Platform Mentorship Program

Program Overview This exclusive Databases Mentorship Program will be a Hands-On Guided Mentorship and learning program…

1 条评论

See all articles

Monitoring Systems with Prometheus - Introduction

Shrey Batra

CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)

Pull Based Approach?

How does Prometheus's System Work?

Let's talk about the basic system

领英推荐

An Example - Calculate cost by exact monitoring

Does Prometheus Scale?

System Design & Architecture

49,143 位关注者

Shrey Batra的更多文章

社区洞察

其他会员也浏览了

Observability Platforms: Importance and the Case for In-House Development

Gateway API vs Ingress Controller in Kubernetes

Setting up Kuma Service Mesh across Kubernetes and VMs

Understanding Microservice Meshes: Architecture, Operation, and?Examples

Monitoring and Observability

Understanding Istio: The Service Mesh for Modern Cloud-Native Applications

Why the fuss about serverless?

OpenTelemetry Series #2 OpenTelemetry vs. Traditional Monitoring: What’s the Difference?

Azure Service Bus geo-paired namespace with automated failover

Monitoring Kubernetes with Prometheus and Grafana

Pull Based Approach?

How does Prometheus's System Work?

Let's talk about the basic system

领英推荐

An Example - Calculate cost by exact monitoring

Does Prometheus Scale?

System Design & Architecture

49,143 位关注者

Shrey Batra的更多文章

Instagram's trick for faster photo uploads and beat competition

How to break a system in Microservices - The invalid myths and the best practises

How to be a SENIOR / STAFF engineer and highlight your impact?

Cosmocloud Deploy - Managed Deployments cheaper than raw VMs / EC2

Using Redis as a Notification Service?

E03 - Finding the best Devops & PaaS Platforms - Azure App Services / Container Apps

E02 - Finding the best Devops & PaaS Platforms - AWS ECS

SMILe and Cosmocloud partners together: Transforming Logistics with Tech-Driven Operations

Building a Custom Link-Clicks Tracking System

Databases & Platform Mentorship Program

社区洞察

其他会员也浏览了

Observability Platforms: Importance and the Case for In-House Development

Gateway API vs Ingress Controller in Kubernetes

Setting up Kuma Service Mesh across Kubernetes and VMs

Understanding Microservice Meshes: Architecture, Operation, and?Examples

Monitoring and Observability

Understanding Istio: The Service Mesh for Modern Cloud-Native Applications

Why the fuss about serverless?

OpenTelemetry Series #2 OpenTelemetry vs. Traditional Monitoring: What’s the Difference?

Azure Service Bus geo-paired namespace with automated failover

Monitoring Kubernetes with Prometheus and Grafana