Monitoring Systems with Prometheus - Introduction

Monitoring Systems with Prometheus - Introduction

Hello everyone, after months writing the first edition of this newsletter! Lets start first with an absolute mind blowing architecture of - how to monitor your distributed system with prometheus.

Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach.

Pull Based Approach?

Yes, you heard it right. Prometheus uses a pull based approach rather than a push based approach. Who said pull based approach does not work, when rather Borgmon (Google's monitoring system) scales to a global environment with tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.

How does Prometheus's System Work?

The following below is the full scale Prometheus setup running in actual production workload along with so many different components..!

No alt text provided for this image
Sourced from Prometheus documentation

Let's start small and break down the actual architecture step by step -

If you see the actual diagram, the smallest part of architecture you can identify is the following components -

  • Prometheus Server
  • A TimeSeries database used by Prometheus (TSDB)
  • A Retrieval process which actually pings and pulls the metrics.
  • Job / exporters known as Prometheus targets.

Let's talk about the basic system

As we saw above, a set of minimal components is needed to understand Prometheus on Day 0.

No alt text provided for this image
Simple Architecture

As you see, the Prometheus Server pulls the data by polling multiple targets -- the services you want to monitor. Each of these service must expose a /metrics API endpoint, with Prometheus enabled data format to expose the metrics to Prometheus.

You can easily add a middleware to enable Prometheus metrics in any framework you work in -- Flask, Spring, FastAPI, Django, etc.

Once your targets (running service instances) exposes these metrics, Prometheus then runs a Scrape Job every X time (configurable) to pull the metrics and store it with its own time series database.

Now, when you have the metrics -- such as Up Time, CPU usage, RAM usage, you can now monitor, query and add alert mechanisms on top of Prometheus' UI and quickly accomplish your goals..!

An Example - Calculate cost by exact monitoring

You can monitor how much cost you are incurring by querying the uptime of your Backend Service (for ex.) by just querying Prometheus to say -- give me the total uptime across my different backend containers (docker containers) over the last 1 month.

No alt text provided for this image

Let's say you have Container 1 running for 2 days, Container 2 running for 5 days and Container 3 running constantly for all 30 days of the month.

Prometheus can query the "health check" data it had polled while collecting (lets say every 5 seconds) and then say that -

  • Container 1 had 34560 successful health pings (1 for every 5 sec it was up). This means that 34560*5 seconds == 2 days uptime.
  • Container 2 had 86400 successful health pings which is 5 days worth uptime.
  • Container 3 had 518400 successful health pings which is 30 days worth uptime.

Now, your total uptime for your backend service =

  • 34560+86400+518400 health pings
  • or 649360 health pings
  • or (649360 * 5s) = 32,46,800 seconds

Now, you can easily calculate your total cost (lets saw AWS charges $0.01 for every 60 seconds / 1 hr, given some CPU/RAM)

32,46,800* 0.01 = $32468 per month

Does Prometheus Scale?

For sure! Obviously, a single Prometheus server cannot be said fully reliable as well as cannot scale when you have millions of machines / servers / containers running. We will see how to scale Prometheus (pull based system) for a fully distributed system and how well does it do ??

Don't forget to Subscribe this newsletter, so that you don't miss out on the next edition! If you liked this article, show your ?? by liking/commenting on this post!

Pradyumn Verma

Building Fielddrive | IIT Jodhpur Alumnus

2 年

Shrey Batra which metric will be good to monitor kafka latency?

回复
Shrey Batra

CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)

2 年
Nikhil Srivastava

Senior Software Engineer at Confluent

2 年

Short and sweet introduction to Prometheus ?? I've been wanting to start learning about this. Quick question: you mentioned that service needs to expose a /metrics endpoint. What if the service itself is down? Do we run this end point in a side car?

要查看或添加评论,请登录

Shrey Batra的更多文章

社区洞察

其他会员也浏览了