Comprehensive Observability for Micro-Services applications

As we all aware of the benefits of microservices, they are easy to implement, independent, maintain & update. Even deployment of microservices is also very easy when we deploy them into cloud environment.

Here, now we must talk about the challenges which come during the runtime, application performance and connectivity among various microservice applications to achieve one workflow or complete a defined business task:

Measuring the ability of services to communicate with each other and deliver the expected results requires specific monitoring capabilities:

?

End-user experience monitoring—measures client operations and performance on browser and mobile devices.

System interaction monitoring—measures the system interactions required to service each transaction. These include interactions between the end-user device and the microservices and other components involved in the user’s request.

End-to-end monitoring—helps isolate issues across the microservices environment.

Another significant challenge is identifying by the SRE to be responsible for each service—different microservices have different teams who understand them and can fix issues.

In a microservices deployment, there is usually a small team responsible for the whole life cycle of each service. Each team must maintain service-specific and cross-service observability throughout the pipeline’s build, test, and release phases. Monitoring should be part of the continuous integration and continuous delivery (CI/CD) pipeline to guarantee the performance of new code releases.

A final challenge is managing the complexity of shared, dynamic services involving continuous, accurate documentation and awareness. It is important to train new employees to understand how each component interacts with the others.

?

Let’s understand one of the challenges by the example here, What if one of the microservice applications goes down and immediate communication to very next microservice is broken? We should have a mechanism to identify such issues proactively or observes all microservices components from a level of egal eye.

With greater scale and complexity comes a greater need for observability. There are many potential points of failure and constant updates in a microservices architecture, which cannot be addressed by traditional monitoring solutions. The many unknown, dynamic factors in a distributed environment make it necessary to build observability into the system by design.

Hence, we need a comprehensive observability for these microservice applications/architecture to control to identify and address issues quickly.

Let’s discuss more on the solutioning part of observability, on which observability is based on:

?

M.E.L.T.

MELT stands for Metrics, Events, Logs & Traces.

Metrics:

Metrics are numerical parameters for the application which provides you in-depth state of the application in terms of resource utilization, status of the application, communication among other correlated applications etc.

Metrics enable mathematical modeling and forecasting, which can be represented in a specific data structure. Examples of metrics that can help understand system behavior include:

  • Resource utilizations such CPU, Memory and Storage utilizations
  • Error rate, latency, Calls, processing time, infrastructure changes, traces etc.

Utilizing metrics has several advantages, such as facilitating extended data retention and simplified querying. This makes them great for constructing dashboards that display past trends across multiple services.

?

Events:

Events are the source of proactive decisions which are being taken by the user/developer to keep application(s) up and running without any issue by intercepting events based on defined thresholds on application’s metrics.

Event leverage application owner or SRE to keep defined application components/entities to be observable all the time. Modern observability tools like IBM Instana gives freedom to the SRE to create their own events and start monitoring of the application at any time.

There are two kinds of events I am interested in from the perspective of observability of software systems.

?

1.??? They happen over time, and their performance, like frequency, presence, or absence, is fascinating. Example: Average hourly take-offs from the San Francisco Airport in the last week.

2.??? The event and its data are of interest. Example: When was the last time CPU for Host1 reached 90%?

?

Event always comes in defined structure and information for the entities on which they occur.

Events can also provide correlations in between or among the entities.

It is also must do Event Management to keep our systems observable, the event handling process relies on inputs from system notifications and monitoring tools outputs which are then taken through the following activities as guided by a monitoring plan:

·??? Event detection

·??? Event logging (for significant events)

·??? Event filtering and correlation check (might be iterative)

·??? Event classification (critical, major, medium, minor)

·??? Event response selected.

·??? Notifications sent; response procedure carried out.

These activities can be manual or automated depending on the service provider organization’s capabilities, and result in appropriate responses including event analysis, incident management and stakeholder engagement. Event management is not simply the action of responding to system alerts, but rather an all-encompassing capability that requires people (roles), information and technology, processes and where required partners and suppliers for success.

?

Logs:

What did we do when we needed to troubleshoot the issues found in applications 5 years ago or so?

We used to ask to client or concerned person to share the application logs, so we can check the exceptions in the logs and troubleshoot to fix the issues.

In a similar way in the world of observability, Logs provide a descriptive record of the system’s behavior at a given time, serving as an essential tool for debugging. By parsing, one can gain insight into application performance?that is not accessible via APIs or application databases.

A simple explanation would be that logs are a record of all activities that occur within your system.

Logs can take various shapes, such as plain text or JSON objects, allowing for a range of querying techniques. This makes logs one of the most useful data points for investigating?performance issues and security threats along with identification of the root cause at code base level too.

To make better use of logs, aggregating them to a centralized platform is essential. This helps in quickly finding and fixing errors, as well as in monitoring application performance.

?

Logs can be in any format but widely used format is text along with timestamp. Now a days, structured logs are also being popular as these logs can be ingested as is into various observability tools and produce meaningful outcome for identifying issues promptly.

?

Traces:

Traces means what? Calls? Or more than this? These questions are always come in our mind to understand traces in observability and how Traces can help to find the act upon the issues occurred within the systems.

Traces are deep dive mechanism of calls which are corelated to each other and gives you a holistic picture of your business process workflows in terms of interdependent calls/sub-calls to achieve one set of workflows, irrespective to from where call is originated and where the call(s) get ends.

These traces also provide details of microservices components which are being used to complete workflows starts form source to destination of the microservice applications/components.

Traces can also leverage SREs to look into the issues if and broken links in between sub-calls.

Traces can leverage to provide code level stack-traces to pinpoint the issue at code level too.

?

What we need to observe for microservices application?

?

At application Level:

1.??? Application Latency: mean latency, 25th, 50th, 90th and 95th percentile Latency. For example. 25 percentile latency means, by this latency time period application’s 25% of requests/call are successfully executed.

?

2.??? Processing Time: ?how much time is spent doing processing in an application, a service or an endpoint itself (Self), and how much time is spent calling the downstream dependencies, which is broken down by the call type, such as?Http,?Database,?Messaging,?RPC,?SDK such as Open telemetry, etc. So basically it is total time to complete one request minus call's latency.

?

3.??? Number of requests per second/minute – How many requests are being catered by individual service (a collection of endpoints)

?

4.??? Erroneous/Failed requests per second: How many requests are being failed or having issues and served with exceptions/errors.

5.??? Erroneous/Failed request call rate: A percentage of erroneous/failed requests out of total request are being catered by the service/application.

?

6.??? Average response time per service endpoint: Response time average for each endpoint belongs to monitored service(s).

?

7.??? Top N Services by Number of Calls & Erroneous calls & Latency: Top 5/10 services based on calls/requests, failed requests, and by Latency.

?

At application resource Level:

1.??? CPU Resource Utilization (Application, Container, Pod, Nod): CPU utilization for all application and respective entities level.

?

2.??? Memory Resource Utilization (Application, Container, Pod, Nod): Memory utilization for all application and respective entities level.

?

3.??? Storage Resource Utilization (Application, Container, Pod, Nod): Storage/Disk/Volumes utilization for all application and respective entities level.

?

4.??? Heath Status: By checking health of container, pod.

?

5.??? Host count—the number of hosts or pods running the system (enables the identification of availability issues resulting from crashed pods).

?

6.??? Live threads—the number of threads spawned by the service (enables the detection of multi-threading issues).

?

7.??? Heap usage—statistics related to heap memory usage (for debugging memory leaks).

?

Few Golden Signals metrics:

1.??? Availability—the system’s state as measured from the client’s perspective, such as the ratio of errors to total requests.

?

2.??? Health—the system’s state as measured using regular pings.

?

3.??? Request rate—the rate of requests coming into the system.

?

4.??? Saturation—the extent to which the system is free or loaded idle time or system load (e.g., available memory or queue depth).

?

5.??? Usage—the system’s usage level (CPU load, memory usage, etc.), expressed as a percentage.

?

?The above metrics are good enough to observe microservice applications on distributed/cloud environments, but this is not the end. In modern era the complexity of applications with respect to businesses, cross-business integrations, hyperscalers (such as AWS, GCP, IBM Cloud, Azure etc.) specific managed services uses makes this observability scope much wider and we need to more exhaustive observability tools in future. ???

Jasprit Singh Nandre

See inside any stack, any app, at any scale, anywhere

9 个月

Looks like you need New Relic.. Lets talk! :)

回复
Sandesh Naik

CEO, SnehuSan Software Services (S-Cube)

1 年

Very nice ??

回复

要查看或添加评论,请登录

Brajendra Mishra的更多文章

社区洞察

其他会员也浏览了