Business Process Expectations and Messaging Systems like Kafka
Animesh Mukherjee
Experienced in large-scale hybrid IT operations with emphasis on cloud, cost, cyber and ITSM.
Messaging systems like Kafka are used to distribute messages and data streams in all sorts of applications, mostly in cloud-native applications that have been composed using microservices. It is an essential connector between microservices that produce and consume data, doing this at scale and with exceptionally good reliability. Messaging systems can be installed in your data center or consumed as a service from the public cloud providers such as AWS.
Understand the underlying business processes
Several technical decisions need to be made while configuring these systems, based on a deep technical understanding of the solution, the operating system and the infrastructure being used. However, the first step is to understand the business processes that are implemented by the microservices that produce and consume the messages and data streams. In some cases, like transaction processing applications, every message is important and must be delivered exactly once. The order in which the messages are processed may also be important. For example, when updating a bank account, deposits and withdrawals need to be handled in the correct order, otherwise an out-of-balance condition could occur, and the customer charged for an overdraft which is not her fault. In other cases, such as telemetry, the latest data is important even if some data was lost in between. When performing trades in the stock market it is important to use the latest price, and not bother to catch up on previous data that might have been missed. The business need being met has to be understood first, before the technical design can be finalized.
Since Kafka and other systems store the messages by default, data retention rules need to be followed and information security considered. For example, an e-commerce site that avoids storing personal and account information by using external services for payments needs to ensure that none of the information is persisted somewhere in the data stream. The storage used by Kafka must be extremely fast to quickly accept and distribute the data, but it need not be retained on fast storage after it has been consumed. Especially in a cloud-based implementation, applying retention rules and archiving unneeded data will save on running costs. Once again, the business requirement has to guide the design and help control the cost.
Monitoring and Response
Hardware and software components will fail at some point and the architecture of the system must be designed to minimize the business impacts of these failures. Monitoring the operations of the entire system is particularly important. This needs to include the producers, consumers, and various components of the messaging system. If the producers are generating data too fast for the system to ingest, or the consumers are picking up data too slowly, action may need to be taken. In addition, the status of the failover between brokers and partitions has to be monitored.
Monitoring the business processes for business data and transaction integrity is very important. Orders, payments, or business transactions received must be reconciled with what was recorded and payments requested and received. These need to include external systems and APIs used, or even third parties to ensure that data loss, corruption or duplication has not occurred. In addition to the technical team that investigates and fixes technical errors, there must be a business operations team that investigates and fixes business process anomalies. All these teams must follow SRE principles, gaming for the possible errors, how they would be detected, and the automated scripts needed to fix them. This needs to support the need for ‘undo’ in case erroneous data is received from other applications, or even ‘wait’ cycles in case those systems have failed and need time to be fixed.
领英推荐
Service Levels
Service Level Objectives (SLOs) need to be defined and measured for business performance, availability, data integrity and security of all applications that are critical to the business. The underlying measures that ?lead to the SLOs must be part of the operations dashboard that the business and technical operations teams are looking at so that any deviation can be fixed quickly. This needs to be combined with the technical indicators to help quickly correlate the business process errors with technical events.
This is an example of the way that in addition to technology, applications have to be built considering the business, processes and metrics in order to successful.
About Tailwinds
At Tailwinds we are helping teams design, build, deploy and operate cloud-native applications securely with lower cost and faster time to market using our Internal Developer Platform (IDP) product - MajorDomo.
#SLO?#errorbudget?#cloudnative?#sre?#platformengineering?#internaldeveloperplatform?#itbm?#itsm?#itom