Distributed System Design Patterns

Distributed System Design Patterns

In this article we will discuss distributed system design patterns which are essential for any distributed system design.

We will start by discussing fallacies of distributed computing, then we will discuss design and implementation patterns, resiliency patterns, messaging patterns, and finally data management patterns.

Before we go into the details of distributed system design patterns, let's briefly discuss the fallacies of distributed computing.


1.????Fallacies of Distributed Computing

The fallacies of distributed computing are the assumptions that developers and architects often make when designing distributed computing systems. These assumptions are false assumption in the distributed system architecture. Failure to consider those fallacies when designing system, may lead to the systems which are prone to failure, hard to maintain and operate, unsecure, unreliable, and non-resilient.


Following tables summarizes the 8 fallacies of distributed computing.

No alt text provided for this image

Let’s briefly discuss each of these.

1.1????The network is reliable

This fallacy assumes that network is reliable all times. Every packet will be sent to the destination without failure. In reality, we have hardware, software, routers / switch failures which cause network outages.

You need to consider solutions such as retry & acknowledgement, and the use of reliable messaging infrastructure like message queues, service brokers etc.

1.2????Latency isn’t a problem

Latency is the amount of time it takes for data to transfer from one node to another. Developers often assume that like they develop on their local computer, latency on service calls over network will be similar or close to zero. It may be small on LAN, but as we move on WAN or internet, it could be large.

In distributed computing you need to avoid inter-service chit-chat. Your microservices should be well designed and should keep the data it needs to operate. You also should not cross the network if you do not have to. In case you must call other services, carry all data you need to call other service.

1.3????Bandwidth isn’t a problem

The Bandwidth refer to an amount of data that can be transferred from one node to another in a given timeframe. Although bandwidth keeps growing, the amount of data also grows faster. When transferring lots of data in a given period of time, network congestion may interfere. Developers often fetch too much data using object relational models (ORM) such Entity Framework, which can cause network congestion.

To avoid congestion and bandwidth problems, you should move time-critical data to a separate network. Developers should also lazy-load data in ORM. You may also need to have more than one domain model to resolve forces of bandwidth and latency

1.4????The Network is Secure

Even if you operate on a single computer which is isolated from network, you cannot even assume that it is secure. You network can be compromised in a variety of ways like viruses and malwares, vulnerabilities in operating system or router software, DDoS attacks, cross-site scripting etc.

You need to adopt security in all layers of architecture. You need to develop your application by following secure coding practices. You need to perform thread modelling, have security mindset at all layers, and apply defence in depth strategies.

1.5????The Topology won’t change

The topology refers to the way in which nodes of network are arranged and interact with each other. In the world of distributed computing, topology change all the time. Sometimes due to accidental reason or sometimes due to upgrades.

If you hard code configuration values in your application, your application could stop responding if configuration gets changes at some point. You should never hard code network configurations such as IP addresses, ports etc. in your code. You should externalise these configurations either in config files or environment variables, so that you could change the configs without redeploying the code. You should also use service discovery at runtime to avoid any interruption which may occur due to config changes.

1.6????The admin will know what to do

In small environments there may be one administrator for maintaining the environment such as installing, upgrading, and patching the application. This approach has changed in the modern cloud native and DevOps practices. You may have different administrators in your environment.

You should assume that not all administrators have knowledge about your system and deployment. You should make it easy for different administrators to manage your system. You should deploy your system using an automated CI/CD pipeline in all environments.

Your admins should have high visibility into your system. You should design your system to be highly observables. Logs, metrics, and tracing should be readily available to diagnose and resolve issues as it may arise. ?

1.7????Transport cost isn’t a problem

There is a cost of transporting the data over the network in terms of serialization before crossing the network, and deserialization on the other side of the network. There is also a hardware and network infrastructure cost, both upfront and ongoing.

When designing system, architects need to make trade-offs between infrastructure costs and development costs – upfront vs ongoing.

You should avoid designing chatty services. You should only call the service if you need to. You should design the services with the minimum data requirement in both request and response contract. You should also avoid XML based serialization and should adopt lightweight protocols such as JSON.

1.8????The network is homogeneous

In distributed computing multiple types of devices are connected, which operates on different operating systems, different communication protocols, different configurations, different bandwidths etc. Having components with similar configurations is hard to achieve. For example, you have little control over what mobile devices use to connect to your application.

You should focus on interoperability to ensure that different components can talk to each other despite having differences. You can use standard protocols such as HTTP, WebSocket, MQTT, which communicate on data format using JSON.


2.????????Design and Implementation Patterns

In this section we will discuss the design and implementation pattern. We will start with Strangler pattern which is commonly used for modernising the monolith application.

2.1????Strangler Pattern

2.1.1?????????Problem

Modernizing monolith system to microservices based architecture is challenging as you must keep the legacy systems running while doing the gradual migration to new system. Legacy system still need to handle the request for the modules which have not been migrated.

You may potentially be running legacy application and one or two microservices. It will create challenge for client, as client application need to know where each module or service is located and how to call them. For every new migration, client will need to know location of new service to call.

2.1.2?????????Solution

Gradually migrate the legacy monolith application to the microservices. Take out one module representing a particular business capability and create a service around it. Deploy this microservice separately. Create a fa?ade layer which would intercept the client request and route the request to either legacy application or new microservice based on the request. You can continue to migrate legacy modules to new microservices, while fa?ade will keep routing the client request to respective applications.

In this pattern, client application is unaware of migration and can continue to call the legacy application as it used to call it before.

No alt text provided for this image


2.1.3?????????Use of this pattern

Use this pattern when you are migrating legacy monolith application to new microservice based architecture and you need to gradually take out modules into microservices.

2.1.4?????????Issues and Considerations

While migrating modules to new microservices, you need to migrate data of that modules. You also need to make sure that both legacy and new microservice can continue to operate even after you migrated the data. You also need to make sure that fa?ade layer become highly available and should avoid it being a single point of failure.

Once migration is completed, you need to adopt strategy in place to remove fa?ade layer or replace it with better implementation. One such implementation is to use API Gateway.


2.2????Anti-Corruption Layer Pattern

2.2.1?????????Problem

When you follow strangler pattern to gradually migrate legacy monolith application to new microservices based architecture, strangled microservices often need to communicate with legacy system to serve request. Legacy system often support outdated infrastructure, protocols, data models, APIs etc., which are not suitable to be incorporated into new microservices.

2.2.2?????????Solution

Introduce anti-corruption layer between new microservices and legacy system calls. This layer would serve to translate communication protocols between system.

Anti-corruption layer allows the integration of two systems without allowing the domain model of one system to corrupt the domain model of the other. It is a way of creating API contracts that make the monolith look like other microservices.

Anti-corruption layer can be implemented into three submodules:

  • Fa?ade – it simplifies the process of integrating with the monolith’s interface
  • Adapter – add services layer. Services would take a request using an agreed protocol and make that request to the monolith’s fa?ade
  • Translator – convert requests and responses between the domain model of the monolith and the domain model of the new microservice

No alt text provided for this image


2.2.3?????????Use of this pattern

Use this pattern mainly when you are modernizing legacy applications to new microservice based architecture. New microservices need to talk to legacy application which only listen on legacy protocols. You could also generally use this pattern when two systems need to communicate with each other on different semantics.

Anti-corruption layer can also add functionality such as latency, scalability, availability, monitoring, configuration management, release management through this anti-corruption service layer

2.2.4?????????Issues and Considerations

Anti-corruption layer should also not be a single point of failure. You also need to consider if you need one anti-corruption layer per microservice or you could use one shared anti-corruption layer. Later may pose challenges in terms of scalability and availability. You also need to consider whether you make this layer permanent or retire it after migration.


2.3????Ambassador Pattern

2.3.1?????????Problem

Organizations often want to incorporate cross cutting concerns such as routing, metering, monitoring, circuit breaker, retries etc., across fleet of its applications. It is sometimes difficult to update all applications to incorporate those concerns, because code is outdated, or development teams doesn’t have knowledge of legacy applications, or you are dealing with COTS applications for which you don’t have access to the code.

2.3.2?????????Solution

Incorporate cross cutting concerns into an external process which will act as a proxy between your application and a client application. Deploy this proxy along with your application into the same host where your application is deployed.

No alt text provided for this image


2.3.3?????????Use of this pattern

This pattern is useful when you need to add cross cutting concern in your legacy application without changing the legacy application code. This pattern is also useful if you need to incorporate common set of connectivity features or to support cloud connectivity requirements into legacy applications.

2.3.4?????????Issues and Considerations

Proxy layer may add latency into your application therefore this pattern may not be useful for latency sensitive applications.

Some cross cutting concerns may not be safe to incorporate into the proxy layer if legacy application does not support it. For example, retry may not be safe if legacy application does not support idempotent operation.

You also need to consider how you would deploy the proxy. From design perspective, you need to consider if you need to design one proxy per client or shared proxy for all clients. Each option has pros and cons which needs to be considered.


2.4????Gateway Aggregation Pattern

2.4.1?????????Problem

Application often needs to call multiple services to complete the operation. For example, to get the product information, application would first need to call product service to get basic information, then call product price service to get product prices, and then call inventory service to get available stock for this project.

This becomes more complicated if we later add more services to provide product information. Client application would need to make a call to those additional services.

This design leads to an increased network calls and add chattiness between client application and calls to multiple backend services. As we discussed in section 1, this should be avoided in distributed computing.

2.4.2?????????Solution

Add another layer of service which would serve as a gateway. This additional service call multiple backend services, aggregate the result, and return to client application. Gateway reduce the chattiness between client application and your backend services.

No alt text provided for this image


2.4.3?????????Use of this pattern

This pattern is useful when client need to communicate with multiple backend services. This pattern is also useful when client has strict network or bandwidth requirements such as mobile clients.

2.4.4?????????Issues and Considerations

This pattern may be not suitable for latency sensitive applications as gateway layer may add additional latency.

Gateway service could become single point of failure; therefore, you need to pay special attention on availability, scalability, monitoring, and resiliency of this service.

Gateway service can adversely affect performance specially if you need to call multiple services. One solution to that is to call multiple backend services asynchronously using patterns like Observer pattern. If one of the backend services becomes unavailable, you could fallback to returning the cached response of that service.


2.5????API Gateway Pattern

2.5.1?????????Problem

In the microservice based architecture you often end up with lots of fine-grained services. Consuming those fine-grained services from the client side becomes challenging since:

  • Difference clients may have different data requirements
  • Client may not support protocol in which target service is accepting the request
  • Client bandwidth requirements e.g., mobile vs desktop
  • Service discovery
  • Service versioning

You also need to implement common features across fleet of services such as authentication, authorization, monitoring, rate liming, throttling, and logging.

2.5.2?????????Solution

Implement the API Gateway which acts as a single-entry point into the system. It is like the Fa?ade pattern. API Gateway encapsulates the internal system architecture and implementation details from the client.

It provides the features such as:

  • The single endpoint for various type of service and hides the complexity of service discovery, versioning, failure, fault tolerance etc.
  • Prevent system from failure by implementing fault tolerance and providing the cached/default data where appropriate
  • Minimize the network calls from a client to the server as a single call to the server would return all the required data as per client needs
  • Billing
  • Rate Limiting or Throttling
  • Request authentication and authorization

No alt text provided for this image


2.5.3?????????Use of this pattern

This pattern is useful when your system comprises of lots of different microservices, and you need to complement common cross cutting concerns.

This pattern is also useful when you need to authenticate request and terminate SSL traffic at gateway level to save CPU time of each of microservices.

2.5.4?????????Issues and Considerations

API Gateway adds another layer into your architecture which can become a single point of failure.

API Gateway should be highly available, scalable, and resilient. You should not incorporate any business logic into this layer.


2.6????Backends for Frontends Pattern

2.6.1?????????Problem

Different frontends applications have different data and bandwidth requirements. For example, mobile devices may have different UI requirements than desktop applications. Client streaming movie on hand-held devices may have different requirements to streaming on TV.

One common backend may not suffice requirement of different frontend. It may also require regular change for one interface while other interface would not need those changes. One common backend team can also become bottleneck for different frontend teams.

2.6.2?????????Solution

Create a dedicated backend service for each of the frontend to serve the data specifically required for that frontend as per client, bandwidth, and device requirements.

Each frontend team can control their own backend without relying upon the common backend team.

No alt text provided for this image


2.6.3?????????Use of this pattern

Use this pattern when you have specialised data requirement for UI for each of the different client which cannot be fulfilled by the common backend services.

This pattern is also useful when you have dependency on one common backend team which can become bottleneck and overhead for small UI requirements.

Dedicated backend service also gives the advantage of optimizing the backend requirements for better user experience.

2.6.4?????????Issues and Considerations

It may not be feasible to develop dedicated backend for each type of interface. Instead, you should categorize the interfaces based on the common requirement, and then develop a common backend for each category of interfaces.

It can also make coding challenge as you will have different copies of the same code to access the backing services. Coding synchronization may also become challenge across different backends.

It can give more control to the frontend teams to incorporate business logic into their own backend services. You also need to consider if you need to maintain a generic backend along with the dedicated backend for each of the interface.


3.???????Resiliency Patterns

In this section we will discuss resiliency patterns. We will start with retry pattern which is the most common pattern in disturbed system design.

3.1????Retry Pattern

3.1.1?????????Problem

In microservice based architecture, call to other services can fail due to transient failure of remote service. Sometimes remote services correct themselves and calling service could simply retry to complete its work. For example, remote might be experiencing temporary load and adopt the throttling strategy to fail the request, once load is down remote service could accept the request. Retrying to remote service with delay could potentially be successful.

3.1.2?????????Solution

Retry to call the remote services using the following strategies:

  • Retry immediately – immediately retry to call remote service after failure
  • Retry with delay – retry with delay. Calling service should wait for some amount of time before retrying. This is typically helpful in a situation when remote service is experiencing load.

No alt text provided for this image


3.1.3?????????Use of this pattern

Use this pattern when remote service experience transient failures and remote service is expected to correct itself from these failures. Subsequent call to remote service will likely to be successfully.

3.1.4?????????Issues and Considerations

Retry is helpful if target service is facing temporary failure and could quickly become available, but it could worsen the situation if target service is down due to overload on the service, as retry would put an extra load on the service.

Approach to handle this is backoff. Instead of continuously retrying, client service wait for some amount of time between retries. This also combined with exponential backoff which exponentially increase the wait time between retries after every attempt. To avoid exponentially retrying long time, implementation should cap the backoff to a maximum value. Once service reach the maximum exponential backoff, it should stop retrying and handle failure.

Retry with backoff can also put significant load or contention on system if all failed calls backoff at the same time. They could cause overload again when they are retried. To avoid it we can use Jitters which add randomness on the backoff time. It spread the reties around in time.

Retry has some side effect as well. Consider a case when you call the remote service which timed out with error response. What if remote service may have already processed the request? Subsequent retry may create the duplicate request. In this scenario, remote service should be the idempotent to handle such case.


3.2????Bulkhead Pattern

3.2.1?????????Problem

In microservice based architecture you have several services. Each service calls other service. If one service experience heavy load, it will impact other consumer of the service, causing it to cascade the failure.

In the similar way, consumer may call multiple services, and if one of the remote services is unresponsible, it may hold up client request for long time. As request to service continues, it will exhaust the resources. This will eventually cause consumer to unable to send request to other services.

No alt text provided for this image


3.2.2?????????Solution

Create multiple service instances and partition into different groups based on load and availability requirements. They are named for partitions that can be sealed to segment a ship into multiple watertight compartments. This can prevent damage from causing the entire ship to sink.

3.2.3?????????Use of this pattern

User this pattern when you need to implement resilient services and protect your system from cascading failure. This pattern is very handy to isolate resources to consume backend service. You can use this pattern to isolate mission critical consumers from the other consumer of the service.

3.2.4?????????Issues and Considerations

When applying the bulkhead pattern, consider it to combine with retry, circuit breaker, and throttling patterns to adopt more resilient architecture.

You should deploy your services into the platform which supports partitioning the services. Consider deploying on container-based platform such as Kubernetes. You also need to monitor the performance and SLA of each partition.


3.3????Circuit Breaker Pattern

3.3.1?????????Problem

In microservice based architecture, call to other services can fail due to transient failure of remote service. Sometimes remote services correct themselves and calling service could simply retry to complete its work. Sometimes it takes long time to fix the fault in remote service. Instead of keep retrying, calling services should fail fast to avoid cascading failure in the system.

3.3.2?????????Solution

A service client should invoke a remote service via a proxy that functions in a similar fashion to an electrical circuit breaker. When the number of consecutive failures crosses a threshold, the circuit breaker trips, and for the duration of a timeout period all attempts to invoke the remote service will fail immediately. After the timeout expires the circuit breaker allows a limited number of test requests to pass through. If those requests succeed, the circuit breaker resumes normal operation. Otherwise, if there is a failure, the timeout period begins again.

Circuit breaker is a way to gracefully degrade system functionality when system is under stress. Your code should handle the fallback mechanism when circuit is open. One way to handle this is either return cached response or throw an exception.

No alt text provided for this image


3.3.3?????????Use of this pattern

Use this pattern when you implement the retry logic in your service and you know that call to remote service is likely to fail. This pattern will prevent service to call remote service in this scenario.

3.3.4?????????Issues and Considerations

Calling services must implement a mechanism to handle failures in their code to handle circuit breaker exception. Certain libraries provide the fallback handler in which you could write code to handle the circuit breaker exceptions. You need to implement logic as per business requirement, and as per the type of exception returned from the circuit breaker. For example, you could either degrade the functionality, or return the cached response to the user.

You should also log all failed responses from circuit breaker to enable monitoring of health of the operation.

You should also provide the mechanism to monitor the state of circuit breaker along with an ability to manually override the state to enable administrators to close the circuit breaker or open the state of circuit breaker if the remote service is temporarily unavailable.

You should also implement a separable circuit breaker for each service or resource. Large number of concurrent requests should not block the operation of circuit breaker.

You should implement the logic in your code to retry the requests after circuit breaker turns into open state. One such technique is to queue the requests if circuit is open, and when circuit turns into close state, retry those requests from the queue.


3.4????Cache Pattern

3.4.1?????????Problem

Applications often need to access data from underlying services or backing data store. However, it is impractical to read data from data store on every read as it may impact performance, specially when there are large number of concurrent requests. It may lead to contention on data store and could potentially lead service to be unresponsible. It may further lead to cascading failure.

3.4.2?????????Solution

Services should store the frequently accessed data in cache. This speeds up service response time and avoid load on the underlying system. Services should use an external cache which stores data in a separate fleet such as Memcached or Redis.

Application should retrieve the data from the cache first. If data is not available in cache, application retrieve from the data store and add back to the cache. Any writes to the cache are automatically written back to the data store.

No alt text provided for this image


3.4.3?????????Use of this pattern

This pattern is ideal when you have high number of reads than writes. Use this pattern to handle unanticipated load on your service. You can also use this pattern with circuit breaker to return cached response as a fallback mechanism in case of circuit is open.

3.4.4?????????Issues and Considerations

To avoid dependency on availability of cache, services should implement a logic to avoid cache unavailability or failure. One approach is to fallback to calling the underlying services, but this could overload the underlying service if cache get unavailable for a long period. This could be mitigated by combining external cache with in-memory cache and using the techniques such as load shedding and rate limiting.

Service should deal with the challenge of stale data in cache. One way to deal with this stale data is to introduce cache expiration in the form of time to live (TTL).?TTL determines how long data will retain in cache. TTL is decided based on application requirement which largely depends on the frequency of data change. TTL is typically combined with eviction policy which controls how items will be removed form cache. The most common eviction policy is Least Recently Used (LRU).

Other challenge that services face with cache is the data version/format. New version of service may have introduced new fields. If services read the previous version of data, it should either discard it or handle it, either reading from underlying system or apply default value to new field. In the similar manner, old version of services should also handle the new version of data. This becomes important when you have two versions of same service running side by side.

You should also avoid mass refresh of cache especially when you introduce new version of service as it would put significant load on underlying system to refresh the cache. This could happen if many clients make a request for un-cached data at the same time. This will result in large number of requests go to the underlying services causing it to overload.

One approach to solve this problem is called request coalescing, in which only one request is allowed to call underlying system for un-cached data. Remining requests remains in pending state for data to be served from cache. Some existing caching solutions provide request coalescing. If not, you will have to implement this logic in your code.

Cache solutions should be scalable and should automatically increase the cache capacity as load increases.

You should also consider security of sensitive data in cache. Various caching solutions such as Amazon ElasticCache for Redis supports both in-transit and at-rest encryption. Access to cache should also be restricted, as unauthorised user could populate cache with the value under their control.


3.5????Compensating Transaction Pattern

3.5.1?????????Problem

Relationship between consistency, availability and partition tolerance was defined by Eric Brewer in his popular CAP theorem, which states that given the consistency, availability and partition tolerance, any distributed data store can only guarantee two of these.

Distributed systems are designed with availability and partition tolerance, sacrificing the consistency. In microservice based architecture, data eventually becomes consistent, but if any of the step fails, overall state becomes inconsistent. How do we recover the state of failed step and bring overall system to consistent state?

3.5.2?????????Solution

Track progress of each of the step and implement compensating transaction for each of the step. Compensating transaction must undo the effect of original step, plus all the preceding steps. This must bring the overall system to consistent state.

Compensating transaction logic is application specific and might involve writing the business logic to recover the original step. For example, if payment step fails, compensating transaction would reverse the transaction.

You typically use Saga pattern to use workflow to implement eventually consistent operation. System keeps track of each of the steps performed in workflow, and if any of the step fail, it invoke the compensating step of all previous completed steps. ?

No alt text provided for this image


3.5.3?????????Use of this pattern

This pattern is useful when you have multiple steps to complete the operation and each of the operation requires application specific business logic to undo the step. This pattern is helpful to bring system back into the consistent state.

3.5.4?????????Issues and Considerations

Compensating transaction steps should bring system into consistent state. Compensating steps does not need to be executed in parallel.

You should also have a mechanism to retry the failed compensating step. All the compensating steps must be idempotent so that you could safely retry the failed compensating steps. You should also use the resilient infrastructure such as messaging to track and retry the compensating steps.

In case if system could not be recovered using the compensating step, system should raise an alert for the manual intervention.

Compensating transaction adds complexity into the system as you need to implement the undo step for each of the operation. Undo steps my not be as simple to change the state but you may need to implement business logic opposite to the main operation to undo the step.

Business logic does not have to be exact opposite. But it should bring the system to its original state. From payment example, you should reverse the original transaction which would require creating another transaction into the system.


3.6????Throttling Pattern

3.6.1?????????Problem

Your services may experience increased load over the time, for example, selling popular event ticket during a particular time window. This sudden burst in activity increases the processing requirement than the capacity available for resources, causing poor performance, and eventually fail the service.

You could also have sudden burst of activity due to malicious attacks such as DDOS attacks.

System should implement solution to mitigate such issues. One such solution is to autoscale the services. But autoscale may take some time to spin up additional resource, for example, EC2 instance takes time to spin up. There will be time window in between spinning up the instances that your services could be unresponsive due to scarce resources.

3.6.2?????????Solution

Allow services to operate up to certain limit and then throttle the request when limit is reached. You should have monitoring mechanism in place which would monitor the resources of each of the service, and automatically implement the throttling strategy.

You could implement throttling strategy such as:

  • Rate limit – reject the request if use reaches the N number of requests over a given period
  • Degrade the service functionality, for example, switch to low resolution of streaming video
  • Queue the requests and process it using another worker process
  • Reject the request if system is busy or under stress

3.6.3?????????Use of this pattern

This pattern is useful to deal with an unanticipated load on system. It provides alternate strategy to autoscaling. You could also combine it with autoscaling to handle requests which arrives during the time of spinning up of additional resources. This pattern helps to meet the service SLAs and limit the cost of additional resources.

3.6.4?????????Issues and Considerations

Throttling strategy is a design choice which should be considered early in the design. Poorly adopted strategy may significantly impact the performance of the system. ?

You should have good monitoring system in place to observe the resource utilisation and the available resources for each of the service, and automatically implement the designed strategy in case of increased load. You should also have a mechanism in place to quickly revert to the normal operation should the load gets down on the system.

Throttling decision also depends upon the underlying platform where you deploy your services. Certain platform such as Kubernetes or serverless platform like AWS Lambda can quickly scale your services, which may not require throttling while resource as being added.

If you anticipate the high demand during certain period, you should schedule scaling the resources to cater for high demand during that period. You should also scale back when demand is down.


4.???????Messaging Patterns

In this section we will discuss messaging patterns. We will start with Choreography pattern which is a common pattern for service communication.

4.1????Choreography Pattern

4.1.1?????????Problem

In distributed architecture, microservices need to communicate with each other to complete the request. Developer often use a centralized service which acts the orchestrator. In this case, one service becomes the owner of all routing, transformation, policy, and security etc. It performs interaction and aggregation between services.

Enterprise service bus (ESB) is the typical example of it which is used in Service Oriented Architecture (SOA).

This design introduces a tight coupling between services. Single orchestration service becomes a single point of failure. It also poses challenge on scaling, reliability, and availability of the service.

4.1.2?????????Solution

Microservices should avoid orchestration pattern. It should instead use the choreographic pattern.

Choreography is analogous to the ballet dancers. When circumstances on stage differ from the original plan, there is no conductor present to tell the dancers what to do. Instead, they simply adapt.

In choreography, decision logic is embedded into the respective services. Each service knows when to execute its operation, what data it needs, and what service it needs to call to perform its operation.

No alt text provided for this image


4.1.3?????????Use of this pattern

Use this pattern generally for microservice communication. This pattern is also helpful to remove tight coupling between service and avoid one service to become the bottleneck and a single point of failure.

4.1.4?????????Issues and Considerations

Implementing workflow is often difficult in this pattern as there is no centralized coordination logic in place. You can use this pattern with Saga pattern to implement workflow.

Adding new service into the workflow is also challenging. There is also a risk of cyclic dependency between services.

Distributed tracing is also challenging in this pattern as some of the services may not have processed the message from the queue.

Integration testing is difficult as you must simulate all services running and coordinating in the request chain in the timely manner.


4.2????Publisher-Subscriber Pattern

4.2.1?????????Problem

In microservice based architecture, services often need to inform other services as event occurs. Sender does not need to know the identify and location of other service interested in listening to the event. Synchronous call can lead to tight coupling between services. This also does not work very well if you want to add more listeners. Services need to find a way to inform other service as and when event occurs without blocking the call and without knowing the identity of listeners.

4.2.2?????????Solution

Use messaging system like message broker or service bus to asynchronously send event from sender to listener. Sender, also called publisher, would publish the event in the known message format into the message broker or service bus. Listener, also caller subscriber, would subscribe to a particular message format called topic in the message broker or service bus. When the message arrives of a particular topic, each subscriber would receive a copy of that message for processing.

No alt text provided for this image


4.2.3?????????Use of this pattern

Use this pattern when service need to broadcast information to multiple services. You can add as many listeners as possible to listen to the event without any impact on the sender service.

This pattern is commonly used for the implementation of service choreography and Saga pattern of service communication.

This pattern is used to implement eventual consistency to adopt available-partition tolerant aspect of CAP theorem. Subscribers may have their own availability and SLA and would listen to and process the message when they are available.

4.2.4?????????Issues and Considerations

Central to this pattern is the message broker or service bus. Rather than developing your own message broker or service bus, you should use one of the available products in the market. You could use open-source products like RabbitMQ or Kafka, or could go with cloud provider products like AWS SNS, or Azure Service Bus.

This pattern is not suitable for small number of service as it can add additional complexity. For small number of services, you could use patterns like asynchronous communication.

You should pay special attention on security and only allow authorised services to subscribe to the message in the message broker. ?

This pattern is not suitable if your sender expects response from the subscriber. In that case, you would need to use request/reply pattern.

While using this pattern, you are dependent upon the capability of an underlying message broker product. When choosing product, you need to consider certain challenges with messaging as per your use case. Some of these challenges include message ordering, message priority, message expiration, message duplication, message scheduling, etc. ?


4.3????Queue Based Load Leveling Pattern

4.3.1?????????Problem

You service might experience large number of requests during certain time. If service is not designed to sustain that high load, it may become unresponsive and may cause it to fail. It can also lead to cascading failure to other services in your system.

Certain services need to be highly responsive and should immediately return response to the consumer while deferring the processing of request. How would you design such services?

4.3.2?????????Solution

Add message queue in the solution. As service receives the request, it would perform request validation, and then pass message to the queue. Add consumer process which would dequeue the request from the queue and process it.

Large number of requests will be held in the queue, which acts as a buffer, until consumer process would retrieve and process those request messages.

No alt text provided for this image


4.3.3?????????Use of this pattern

This pattern is useful if your service is subject to high load. This pattern is used to make services highly available and more resilient. It prevents cascading failure across other services in your system. This pattern is also useful for latency sensitive applications which requires immediate response from the services.

4.3.4?????????Issues and Considerations

This pattern relies on the message queue technology. Instead of developing your own message queue, you should use one of the available products in the market. You could use open-source products like RabbitMQ or Kafka, or could go with cloud provider products like AWS SQS, or Azure Queue Storage.

In this pattern, services return immediate response to the caller without the confirmation of request processing. You should have a mechanism in place to return confirmation to the caller with successful processing of request. For example, order service returns the immediate response of “order received” to caller after receiving the request. After processing of order, it sends the confirmation email to customer that order has been processed.

Your design will be dependent upon the capability of an underlying message queue product. When need to consider concerns like message ordering, message priority, message expiration, message duplication, message scheduling, etc., in your design.

Large number of requests could burst the queue limit if consumer processes could not dequeue and process the request in the timely manner. You should autoscale consumer processes based on the queue depth.

You should secure access on the queue. Only authorised services should send and receive message from the queue.


4.4????Priority Queue Pattern

4.4.1?????????Problem

Services often use message queues to delegate task to the backend consumer processes. Messages are typically processed using first-in first-out (FIFO) structure. In some case you need to process some request with higher priority than the other. These requests need to be processed by consumer process with high priority.

4.4.2?????????Solution

Add high priority queue in the design. Requests with high priority will go into the high priority queue. Request with normal or low priority will go into the normal queue. Each queue can have a separate pool of consumer. High priority queue can have larger pool of consumer with high computing resources than the low priority one.

No alt text provided for this image


4.4.3?????????Use of this pattern

Use this pattern when you need to handler multiple requests with different priorities. This pattern is also helpful if services need to stick SLAs for certain high priority requests.

4.4.4?????????Issues and Considerations

This pattern adds complexity into your architecture as you must manage queue for each different priority. This adds more complexity if each priority request has processing SLAs. It will require to monitor the processing rate of each of the message.

Sometimes you do not have large number of high priority messages. In that case, you may end up with running many idle consumer processes on high compute nodes which would increase your infrastructure cost. Consider running one or two consumers initially and set autoscaling based on queue depth to add more consumer should you have large number of high priority messages arrive in the queue.


5.???????Data Management Patterns

In this section we will discuss data management patterns. We will start with CQRS pattern which is commonly known as Command and Query Responsibility Segregation.

5.1????CQRS Pattern

5.1.1?????????Problem

In traditional application design, the same data model is used to query and update database. This works well for small applications and databases, but as you increase reads and writes, this creates database performance and scalability issues. Databases need to place lock on writes, during which time reads must wait. If you have so many reads waiting, while database is busy updating large amount of data, it will create resource contention, exhaust resource pool, and will consequently make your database and services to be unresponsive.

5.1.2?????????Solution

CQRS stands for Command and Query Responsibility Segregation. Segregate reads and writes into the separate read and write data stores. Synchronize read and write data stores. Read data from dedicated read data store using query and write data to dedicated write data store in the form of command.

There are certain rules for reads and writes:

  • Query should read from read data store only. Reads should not modify the data
  • Writes should be command based which execute task to write to data store. Command should be named based on the action to be performed e.g., place order
  • To distribute the load, you can place command on the queue to process asynchronously

No alt text provided for this image


5.1.3?????????Use of this pattern

Use this pattern when you have many reads and writes in parallel. This pattern allows to segregate reads and writes without having resource contention and performance issues.

This pattern is also useful when writes require complex business logic. Services can delegate that responsibility to respective commands. This also helps to orchestrate various writes using commands which can be executed using workflows.

You can use this pattern for latency sensitive applications which need quick response on every read requests.

You can also fine tune the performance and availably of read and write data store. For read intensive applications, you can scale the read data store and synchronize it with write data store.

5.1.4?????????Issues and Considerations

This pattern leads to complex application design with segregated business logic in commands. You also need to synchronize read data store with write data store, which adds some delay. While data will eventually be consistent across both data store, you may read stale data.

You need to keep schema consistent across read and write data stores, otherwise it may not synchronize data.

You can use messaging to process command asynchronously, and to distribute the load. In that case, you need to have a mechanism to retry failed commands and detect duplicate messages. Messaging also creates challenge like message ordering, message priority, message expiration, message duplication, message scheduling. You should consider these challenges in your design.


5.2????Materialized View Pattern

5.2.1?????????Problem

Traditional data stores are often written in a normalized form to cater for the storage efficiency. These data formats do not focus on future read requirements.

Complex and growing need of data analytics requires complex queries which often require to read from different entities, with complex joins, which not only read unnecessary data, but also ??resource intensive on database.

5.2.2?????????Solution

Create a view that materialize the data in the format required to query the data. Materialized views can include the values of calculated columns or transformation of values, which saves the time of calculation or transformation at query time.

Every write to database refreshes these materialized views along with calculated columns. Materialized views are indexed to increase the query performance.

Materialized view are populated in advance and indexed, which massively saves the processing time required for complex joins, aggregation, and calculations.

No alt text provided for this image


5.2.3?????????Use of this pattern

Use this pattern when you have complex queries which requires reads from multiple entities to perform complex joins and calculation or transformation at query time. You perform frequent queries to data store such as business analytics reports which are resource intensive and takes time to execute.

This pattern improves your query performance and provides faster result to your business analytics reports.

These views also simplify your data analytics queries as you only need to perform simple queries on materialized views. Developer also do not need to have knowledge of all entities, aggregation requirement, and computation or transformation logic.

You only need to provide access to materialized view and restrict access to other entities for security and compliance reasons.

5.2.4?????????Issues and Considerations

Materialized view should be updated or refreshed after every write, otherwise your query may read stale data. You should have a mechanism in place to automatically update the materialized view’s data in the cache after every write. Once way to do this is to use database trigger or scheduled tasks. ?

You need to consider where you cache the data of materialized view. It can be on the same data store or partitioned on different data store.

Materialized view should be created with its transient nature in mind. Materialized views can be rebuilt if lost, or its cache can be moved to different data store. Only purpose of materialized view should be to improve query performance and faster reads.

To increase performance, you should consider computation or transformation when you populate data in materialized view. You should store computed data in calculated columns in materialized view. You should also consider indexing the columns in materialized view.


6.???????Final Words

Designing distributed system is a complex process. Developers often make false assumptions about distributed computing. You should fully understand those fallacies of distributed computing and should design your system to avoid those fallacies in your design.

In distributed computing you have lot of fined grained microservices which needs to work together to provide overall system functionality.

You should adopt design and implementation patterns when you design your services. You should implement resiliency patterns to adopt resilient architecture and be ready for day 2 operations.

You should adopt messaging patterns for microservice communication, and to cater for future load and stress on microservices. ?

Data management patterns help to improve performance of both reads and writes on data store. It helps to provide highly optimised solutions for resource intensive reads.



Rhett Blanch

Technology Leader | Digital Solution and Enterprise Architecture

2 年

Awesome article with some really clear explanations Irfan!

要查看或添加评论,请登录

Irfan Muhammad的更多文章

社区洞察

其他会员也浏览了