Building Resilient Cloud Systems: Practical Insights on Decoupling and Scaling

Building Resilient Cloud Systems: Practical Insights on Decoupling and Scaling

Summary

Decoupling techniques play a central role in designing a scalable and resilient system. It allows components to operate independently, reducing dependencies and enabling flexibility. In this article, we’ll look at two communication patterns used in decoupling techniques: Pub/Sub and Message Queues, as well as the AWS services that support these models. Understanding the strengths and limitations of each approach, as well as the mindset required, will allow you to make better decisions that will lead to a better architecture.

From Monoliths to Modular Systems

A monolithic application is designed and deployed as a single unit of code. In this approach, all components are tightly integrated into a single codebase and deployed as a single artifact. These architectures have long been a staple of software development because they offer simplicity in development, testing, and deployment and are sometimes the right choice for smaller, less complex applications.

Because they are tightly coupled, monolithic applications usually face challenges and limitations that make the case for distributed systems:

  • Difficult to scale up, often scaling vertically is the only option
  • Frequent deployments of the entire application
  • Risk of cascading failures due to tight coupling and altogether reduced fault tolerance
  • Tied to a specific technology stack

As applications grow in scale and complexity, these drawbacks become increasingly apparent, leading many organizations to explore more modular approaches like microservices, event-driven architectures, and others. By separating components so they function independently, each part of the system can be developed, tested, deployed, and maintained without impacting the others.

Messaging Patterns and AWS Tools for System Decoupling

Exploring Pub/Sub Patterns

Pub/Sub is a messaging pattern that enables decoupled communication between services. At its core, pub/sub separates message producers, called publishers, from message consumers, called subscribers, by having them communicate through a central broker or channel. This reduces component dependencies and allows multiple services to simultaneously respond to the same events.

Subscribers can filter messages so they receive only what is relevant to them. This is typically supported by attributes or tags and is especially valuable in complex architectures where multiple services consume a range of events but only need a subset of messages for their functions.

SNS

Amazon Simple Notification Service is AWS’s managed pub/sub service. It can distribute messages over various protocols, making it ideal for cloud applications. In SNS, the central message broker or channel is called a topic. Subscribers can include AWS Lambda functions, HTTP endpoints, SQS queues, email, and SMS, among others. When a publisher sends a message to an SNS topic, the message is immediately pushed to all subscribers, allowing for real-time event handling. As a managed AWS service, SNS is designed to be highly available and resilient, being capable of handling heavy loads.

In SNS, each subscriber can define a filter policy, a set of key-value pairs that specify the attributes a message must contain to be delivered to that subscriber. For example, a subscriber could specify a filter policy to only receive messages that have the event type “order_created” and the priority set to high.

Use Cases

The advantages of using pub/sub when building a service become clear when looking at real-world applications:

  • Microservices Communication: In a microservice architecture, each service needs to communicate with others and react. For example, an e-commerce application might have separate services for inventory management, order processing, and user notifications. Using SNS, each service can subscribe to topics relevant to its function and respond without directly coupling to the publisher.?
  • Real-Time Notifications: SNS is often used to deliver messages instantly to users via SMS, email, or push notifications.
  • Data Processing Pipeline: In a large-scale platform, data is often being collected from multiple sources. SNS can facilitate real-time data streaming and transformation by acting as the central point where the data is collected. One pattern frequently seen in the wild is SNS fanning out messages to different SQS queues for each downstream service. Using the same e-commerce example, that could be the orders service publishing to a SNS topic and the topic sending the message to different message queues for inventory or shipping processing.

The Role of Message Queues in Decoupling

Like pub/sub, message queues are another communication model that allows applications to exchange messages. Unlike pub/sub, where messages are broadcast to all subscribers, a message queue usually follows a point-to-point model where each message is delivered to only one consumer. The queue ensures the messages are stored safely until a consumer is ready to process them, making the communication asynchronous and reliable. Typically, when a message is picked up from the queue, it is also removed, allowing the next message to become available for processing. This ensures that each message is handled once, promoting efficient load distribution among consumers.

Based on ordering and processing requirements, there are several types of queues that you’ve probably heard of, I’m just going to mention some of them:

  • FIFO (First-in, first-out processing) - Commonly used in transactional workflows like order processing.
  • LIFO (Last-in, first-out processing) - Used for undo operations or in a “latest update” scenario
  • Priority (processes based on message priority) - Alerts, critical service requests?
  • Dead-letter (stores failed messages) - Error handling

By distributing messages evenly among multiple consumers, they can act as a load balancer, allowing for parallel processing and dynamic scalability. For example, if incoming messages spike, additional consumers can be added for faster processing. This makes message queues an effective way to handle variable workloads and maintain system responsiveness during high-demand periods.

SQS

Amazon Simple Queue Service is AWS’s managed message queuing service. It’s designed to handle message storage, queueing, and delivery reliably and at scale. SQS offers Standard, FIFO, and Dead-letter queues. Let’s see what use case they fit.

The Standard queue offers high throughput with “at least once” delivery, meaning the messages might be delivered more than once but in any order. This is ideal for applications where high availability is essential and exact ordering isn’t crucial. Consider the vendors of an e-commerce platform uploading images for their products. These images go through multiple processing steps. Ordering is irrelevant in the scenario, high throughput is needed and duplicate processing is accessible.

FIFO queues on the other hand guarantee “exactly once” processing and maintain message order. This queue is ideal for applications that require strict message order. In order fulfillment, events for each customer’s order must be processed sequentially to ensure consistency. A FIFO queue can ensure that each step in the order fulfillment, such as payment processing, inventory update, and shipping, is handled in the correct sequence.

A Dead-Letter queue is used to store messages that fail to be processed after a specified number of attempts. Instead of reprocessing these messages indefinitely, they are sent to the DLQ for further analysis and handling. This ensures that messages in the main queues are not piling up. For example, payments might occasionally fail due to insufficient funds or invalid payment methods, these messages can be routed to a DLQ for later inspection and resolution. Setting up a DLQ is a good practice and encouraged when working with any type of queue.

Use Cases

We’ve already covered some examples of using message queues, let’s recap them:

  • Task processing - any heavy workload that can happen in the background, like image processing for use in a platform
  • Transaction processing - scenarios where critical requests have to be processed sequentially and independently of user interactions like order fulfillment.

Support for Multi-Region Strategies

Both SNS and SQS support multi-region setups. This is ideal for reduced latency and increased resilience.

SNS offers cross-region topic replication, allowing messages published in one region to be replicated to topics in other regions, making it ideal for global applications that require real-time updates.

SQS, on the other hand, doesn’t natively replicate queues across regions but can be configured with custom cross-region replication solutions to synchronize messages, ensuring availability even during regional outages. Together, these options enable a robust, multi-region architecture that balances availability and fault tolerance across regions.

Trade-Offs

Availability, reliability, scalability, extensibility, and security are just some of the characteristics of a system that need to be considered. However, when analyzing them closely we see that more often than not, any positive change in one of these characteristics will inevitably lead to a negative change in the other. Hence, any decision you make is a tradeoff and both benefits and drawbacks need to be balanced so that you end up with the least worst architecture.

Consider the example we’ve been using so far, the e-commerce platform. The platform has, among others, the following services: OrderService, PaymentService, InventoryService and ShippingService. Once an order is placed, payment must be processed, inventory must be updated and the warehouses must be notified that they need to ship the order.?

Platform Services

Based on what the article has covered our options are, of course, to use either a topic or a message queue. To be more precise, the choice is whether we use a topic or a separate message queue for each of the downstream services.

Pub/Sub vs Message Queues

Looking at the above diagrams, one case seems simpler and more extensible. That is the topic version. The orders service is completely decoupled from the subscribers, only one connection to the topic is needed and we can easily add more downstream services. In the message queue example, each new service requires a new connection and a new message queue.

By looking at the advantages of the topic approach it seems like it’s the better option. However, we also need to consider the tradeoffs. In this approach, monitoring the number of messages is not possible and that will create difficulties for the system to scale automatically. Moreover, in an ever-changing system, messages might vary and suffer changes. All the services will need to have the same contract and will be impacted by a change to the said contract.?

Since services can subscribe on the fly, a bad actor can also do that, leading to a security issue. If we were to use message queues, the fact that the messages have been read by another service will eventually cause a red flag. Since that is not the case when using a topic, as each message is broadcast to all subscribers, there’s no way of knowing.


So which is the better option? The best choice is the one that aligns most closely with your platform's needs. Deciding what matters more, extensibility or scalability, consistency or availability, depends on factors specific to your organization.

Next Steps

By exploring key communication patterns for decoupling systems and understanding the tradeoff mindset necessary for architectural decisions, you now have the building blocks to create scalable and resilient architectures. I encourage you to keep exploring these techniques further. The more you delve into these strategies, the better equipped you'll be to design efficient and adaptable systems that meet the evolving demands of your organization.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了