Monitoring Systems with Prometheus - Introduction
Shrey Batra
CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)
Hello everyone, after months writing the first edition of this newsletter! Lets start first with an absolute mind blowing architecture of - how to monitor your distributed system with prometheus.
Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach.
Pull Based Approach?
Yes, you heard it right. Prometheus uses a pull based approach rather than a push based approach. Who said pull based approach does not work, when rather Borgmon (Google's monitoring system) scales to a global environment with tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.
How does Prometheus's System Work?
The following below is the full scale Prometheus setup running in actual production workload along with so many different components..!
Let's start small and break down the actual architecture step by step -
If you see the actual diagram, the smallest part of architecture you can identify is the following components -
Let's talk about the basic system
As we saw above, a set of minimal components is needed to understand Prometheus on Day 0.
As you see, the Prometheus Server pulls the data by polling multiple targets -- the services you want to monitor. Each of these service must expose a /metrics API endpoint, with Prometheus enabled data format to expose the metrics to Prometheus.
领英推荐
You can easily add a middleware to enable Prometheus metrics in any framework you work in -- Flask, Spring, FastAPI, Django, etc.
Once your targets (running service instances) exposes these metrics, Prometheus then runs a Scrape Job every X time (configurable) to pull the metrics and store it with its own time series database.
Now, when you have the metrics -- such as Up Time, CPU usage, RAM usage, you can now monitor, query and add alert mechanisms on top of Prometheus' UI and quickly accomplish your goals..!
An Example - Calculate cost by exact monitoring
You can monitor how much cost you are incurring by querying the uptime of your Backend Service (for ex.) by just querying Prometheus to say -- give me the total uptime across my different backend containers (docker containers) over the last 1 month.
Let's say you have Container 1 running for 2 days, Container 2 running for 5 days and Container 3 running constantly for all 30 days of the month.
Prometheus can query the "health check" data it had polled while collecting (lets say every 5 seconds) and then say that -
Now, your total uptime for your backend service =
Now, you can easily calculate your total cost (lets saw AWS charges $0.01 for every 60 seconds / 1 hr, given some CPU/RAM)
32,46,800* 0.01 = $32468 per month
Does Prometheus Scale?
For sure! Obviously, a single Prometheus server cannot be said fully reliable as well as cannot scale when you have millions of machines / servers / containers running. We will see how to scale Prometheus (pull based system) for a fully distributed system and how well does it do ??
Don't forget to Subscribe this newsletter, so that you don't miss out on the next edition! If you liked this article, show your ?? by liking/commenting on this post!
Building Fielddrive | IIT Jodhpur Alumnus
2 年Shrey Batra which metric will be good to monitor kafka latency?
CEO @ Cosmocloud | Ex-LinkedIn | Angel Investor | MongoDB Champion | Book Author | Patent Holder (Distributed Algorithms)
2 年Subscribe to my newsletter - https://www.dhirubhai.net/newsletters/system-design-architecture-6871521381876584448/
Senior Software Engineer at Confluent
2 年Short and sweet introduction to Prometheus ?? I've been wanting to start learning about this. Quick question: you mentioned that service needs to expose a /metrics endpoint. What if the service itself is down? Do we run this end point in a side car?