Cost-Centric Architectural Decision Process

Cost-Centric Architectural Decision Process

Today, I’d like to share the architecture decision-making process I’ve relied on for many years. Starting from the moment a new requirement emerges and moving to the point where we’re ready to design an architectural change.

Making proactive architectural decisions requires a clear understanding of the current business landscape and the technical environment. We’ll begin by establishing this foundation in the background section. Next, we’ll examine how to assess the impact of a new non-functional requirement. To do this, we’ll develop a Quality Attribute Scenario (QAS) to clarify how the requirement impacts the architecture and identify areas that require attention.

Once we fully understand the requirement and QAS, we’ll move to an Architecture Decision Record and compare two possible solutions. Each option will be evaluated for how well it meets the new requirement and for its practical feasibility. By weighing the reasoning behind each choice, we’ll reach a sound decision that sets a robust future state architecture.

While this article focuses on rationale-based decision-making, it will not cover the design part of the selected option. Here, we focus on preparing for change by making architectural decisions that are context-rich, thoroughly analyzed, and aligned with a clear vision of future needs.

Background

ModaMosaic, a renowned fashion house, is known for its high-end and exclusive product lines. One of its premium items, the Midnight Cascade dress, is in high demand and requires meticulous order management and data handling. The organization relies on an interconnected system called Midnight Cascade, which consists of a mobile application, API, back-office UI, database, and image storage to manage orders, update product data, and process inventory. The business expects a surge in sales of the Midnight Cascade dress soon, driven by an aggressive marketing campaign.

High-Level Architecture View

The current logical architecture without technology mapping for the scenario looks like the following:

Figure 1. Midnight Cascade Logical Architecture

Order Placement by Store Managers

Daily, from 1000 to 1500 store managers at various ModaMosaic stores interact with the mobile application to submit customer orders for the Midnight Cascade dress.

When a customer requests an order, the store manager logs in to the frontend system, navigates to the product catalog (populated by data fetched from the API), and places the order. The mobile application sends this order data to the API, which records it in the database. Any product images generated in the mobile application related to the dress are also uploaded to the API and stored in image storage for easy retrieval.

The real challenge is that order updates from the manufacturing side can take hours or even days, often requiring phone calls with the order fulfillment team to confirm if the dress is in production. It is not a human issue. The maintenance team has identified that the order status updates are getting stuck on the API side. Due to the presence of extensive legacy code, resolving the issue isn't straightforward and would require significant refactoring of the component.

Order Management by the Order Fulfillment Team

The order fulfillment team at ModaMosaic includes 10 people and operates on the manufacturing side. The team's primary function is to process and update orders in case of any changes caused by the manufacturing side. The team would greatly benefit from real-time analytics dashboards to monitor orders.

Once store managers submit orders through the mobile application, the back-office UI fetches this order data from the API, allowing the team to review and manage incoming requests. The team accesses the back-office UI to fetch these orders, processes them to check inventory availability and manufacturing schedules, and updates the order status accordingly. This interaction triggers the API to save updated order statuses to the database.

Product Data Management by the Product Team

The product team manages and updates product data for items like the Midnight Cascade dress. The team size is around 10 people, but only a few (2-3) work with the product. When new product details need to be upserted (updated or inserted), the product team accesses the back-office UI. They log in and navigate to the product management section to make changes. The back-office UI communicates with the API to push these updates, storing the latest product data in the database. If any new images or revisions are needed, the UI pulls the images from image storage for review or updates.

Scenario Walkthrough

A store manager at a ModaMosaic location receives an in-store request for the Midnight Cascade dress. They use the mobile application to place an order, which the API saves in the database.

Once the order is received, the order fulfillment team fetches the new order details from the back-office UI, checks for manufacturing feasibility, and updates the order status. The update is then sent back to the API and stored in the database, allowing the store manager to view the status through the mobile application.

The product team, noticing a recent fabric update to the dress, logs into the back-office UI to update the product description and upload a revised image, which the API processes and saves to the image storage.

Core Infrastructure

The Midnight Cascade solution is hosted on AWS in the US East (Ohio) region and leverages on-demand pricing.

Below is a mapping table outlining the relationship between the system components and the corresponding cloud services.

Table 1. Mapping of System Components to Cloud Services

New Business Requirement

To speed up product delivery and enhance customer satisfaction, ModaMosaic wants to implement real-time order status updates for every order placed for high-demand products, including the Midnight Cascade dress. This capability should allow store managers and the order fulfillment team to track orders through each stage in real-time (from placement to manufacturing to shipment and completion). This enhancement must reduce the need for manual follow-up.

Quality Attribute Scenario

The quality attribute scenario is one of the most powerful tools in a solution architect's arsenal. A quality attribute scenario is a structured way to represent a system's response to a particular stimulus, clarifying non-functional requirements. These scenarios define how a system should respond under certain conditions. Let’s transform the new business requirement into a quality attribute scenario.

Table 2. QAS for Real-Time Order Status Updates

Architecture Decision Record

Context

ModaMosaic needs to provide real-time updates on order statuses for the Midnight Cascade dress across its mobile application and back-office UI. The idea is to improve the responsiveness of order tracking for store managers and the order fulfillment team. The target latency for reflecting any change in order status should be under two seconds.

Decision Drivers

  1. The system must ensure that order updates are visible on the mobile application and the back-office UI within two seconds of a status change.
  2. The solution should minimize API requests and compute costs while maintaining the desired level of responsiveness.
  3. The organization prefers to leverage the existing database and API service to avoid introducing additional services or complexity.
  4. The solution should require minimal ongoing maintenance.

Considered Options

Option 1: Polling-Based Updates with Reduced Frequency

This approach leverages periodic polling from both interfaces to retrieve updated order information from the API service.

Instead of implementing continuous polling, which would increase API request costs and load on the backend, the polling frequency is set to an optimal interval of two seconds.

Behind the Scenes

  1. The mobile application requests the API service every two seconds to check for recent order status changes.
  2. Simultaneously, the back-office UI polls the API at the same interval to retrieve updated order statuses.
  3. The API service handles requests from the mobile application and the back-office UI, querying the database and returning updated statuses as needed.

Pros

  1. The mobile application and back-office UI can use simple HTTP requests to the API to retrieve updated order status data.
  2. Polling is a reliable mechanism for fetching data, as it doesn’t rely on real-time event delivery, which could fail under certain conditions (e.g., network issues or service downtime).
  3. This method uses existing components, eliminating the need for significant infrastructure changes or additions like message queues.

Cons

  1. Polling sends regular requests even when there are no updates in the data, leading to inefficiency in terms of API and database resources and consuming bandwidth and processing power.
  2. As the number of stores or orders grows, the volume of polling requests increases. This increase requires scaling the backend infrastructure to handle the additional load. If not properly managed, it can lead to performance degradation.
  3. Although polling at a 2-second interval can provide near-real-time updates, the system still has an inherent delay. For example, even with the reduced polling frequency, order status updates are only reflected after the next polling cycle, which is not truly instant.

Required Efforts

Table 3. Required Efforts for Option 1

Incremental Infrastructure Costs

Infrastructure costs could rise depending on how often polling occurs and the volume of orders.

Compute for the API Service

We have to consider the increased load caused by frequent API polling and the potential need for auto-scaling. Additional details:

  • The cost per t3.medium instance is approximately $0.0416 per hour for the used region.
  • There are currently two instances serving requests.
  • Increase by one instance to handle the polling load.
  • Assume 24/7 uptime due to the polling load, which fully utilizes the instance.
  • Auto-scaling might trigger an additional instance to accommodate peak polling periods. This means we will have up to 4 instances running during peak times.
  • Total hours in a month is around 730 hours (based on 365 days/year).

Using the details above, let's calculate the monthly cost for running these instances in the following two scenarios:

  • Normal load periods, running 3 instances for 24/7: 3 instances 730 hours $0.0416 per hour = $91.14
  • Peak load periods where 4 instances are running 24/7 due to polling load: 4 instances 730 hours $0.0416 per hour = $121.47

If we assume that peak load occurs approximately 40% of the time and normal load occurs 60% of the time, we can calculate blended monthly costs:

  • (60% $91.14) + (40% $121.47) = $54.68 + $48.59 = $103.27

Table 4. API Service Blended Compute Costs for Option 1

To find the incremental monthly cost, we have to exclude the price of two already running API instances from the blended cost.

The estimated blended monthly cost for API compute:

  • $103.27 - $60.74 = $42.53

Load Balancer for API Service Instances

We must consider the load balancer costs of distributing traffic between the API Service instances.

  • AWS Application Load Balancer costs $0.0225 per hour plus $0.008 per GB of data processed.

Cost per month:

  • $0.0225 24 30 = $16.20

Currently, the load balancer processes ~75 GB of data per month:

  • $0.08 per GB * 75 = $6

Assuming ~100 GB of new data will be processed by the load balancer per month after implementing the change, this adds:

  • $0.08 per GB * 100 = $8

Therefore, the total load balancer cost is $30.2 per month, whereas $8 is the incremental cost.

Database Instance Upgrade

The polling approach doesn’t significantly impact database write operations but does require vertical scaling of the existing database instance to handle frequent read requests.

The existing db.t3.medium instance incurs a monthly cost of:

  • 1 instance(s) $0.136 hourly (100 / 100 Utilized/Month) * 730 hours in a month = $99.28

After upgrading to db.t3.large, the monthly cost equals:

  • 1 instance(s) $0.272 hourly (100 / 100 Utilized/Month) * 730 hours in a month = $198.56

Incremental monthly cost for the database instance:

  • $198.56 - $99.28 = $99.28

Summary

Table 5. Summary of Incremental Infrastructure Costs for Option 1

The total estimated monthly cost for option 1 is $149.81.

Option 2: WebSocket-based Real-Time Notifications

The system establishes a persistent WebSocket connection between the client (back-office UI or mobile application) and the server. Thus, the server can instantly push updates to the client whenever changes occur in the system.

In this approach, we will separate the WebSocket implementation from the existing API service, transitioning gradually to the new setup. Consequently, initial costs will be higher since the WebSocket service will be deployed on a separate set of cloud instances, potentially conflicting with decision drivers #3 and #4. The development team expects to move more features to the WebSocket server over time. However, the API service will remain in place, as using WebSockets for infrequent requests is impractical. Over time, API costs are expected to decrease as the workload shifts.

Behind the Scenes

  1. When a client launches, it establishes a WebSocket connection to the server. This connection remains open for the duration of the client session, forming a persistent link between the client and server.
  2. For each event that requires notifying clients, the backend generates a WebSocket message containing relevant update data, such as the new order details or updated product information.
  3. The server uses the open WebSocket connection to push this update immediately to all connected clients.
  4. Upon receiving the WebSocket message, the client processes the data and updates the UI in real-time.
  5. If a client loses connectivity (e.g., due to a network issue), it attempts to reconnect automatically.

Pros

  1. WebSocket connections maintain a single, persistent channel, which reduces redundant traffic.
  2. The system sends updates to clients as soon as they occur, so users see real-time changes without delay.
  3. By pushing updates only when necessary, the system avoids the constant processing demands of polling.
  4. More efficient CPU and memory usage on both the server and client sides.

Cons

  1. Setting up and managing WebSocket connections is more complex than traditional HTTP APIs. Developers must account for connection lifecycle events (such as reconnecting if a connection drops) and client session management.
  2. Maintaining open WebSocket connections for multiple clients can consume server memory and CPU, especially if there are many concurrent connections.
  3. Implementing WebSocket-based notifications requires new infrastructure, leading to higher initial costs.

Required Efforts

Table 6. Required Efforts for Option 2

Incremental Infrastructure Costs

Let's consider the primary components required and estimate the costs based on the increased demand from maintaining open connections for real-time updates.

Compute for WebSocket Service

Since WebSocket connections are persistent, they will demand more memory and CPU resources than typical HTTP requests.

Based on the resources currently allocated for the API workload, we anticipate that two t3.medium instances will offer adequate capacity to support the WebSocket approach, with the possibility of scaling out to three instances.

Let's calculate the monthly cost for running these instances in the following two scenarios:

  • Normal load periods, running 2 instances for 24/7: 2 instances 730 hours $0.0416 per hour = $60.74
  • Peak load periods where 3 instances are running 24/7 due to peak load: 3 instances 730 hours $0.0416 per hour = $91.10

If we assume that peak load occurs approximately 20% of the time and normal load occurs 80% of the time, we can calculate blended monthly costs:

  • (80% $60.74) + (20% $91.10) = $48.59 + $18.22 = $66.81

Table 7. WebSocket Service Blended Compute Costs for Option 2

The estimated blended monthly cost for the WebSocket compute instance is $66.81.

Load Balancer for WebSocket Connections

WebSocket connections typically require a load balancer configured to handle sticky sessions and WebSocket compatibility.

  • AWS Application Load Balancer costs $0.0225 per hour plus $0.008 per GB of data processed.

Cost per month:

  • $0.0225 24 30 = $16.20

Assuming ~100 GB of new data will be processed by the load balancer per month after implementing the change, this adds:

  • $0.08 per GB * 100 = $8.

Therefore, the total load balancer cost is $24.2 per month.

Database Instance Upgrade

More frequent real-time updates may require a higher-capacity database instance to manage the increased read and write operations.

The existing db.t3.medium instance incurs a monthly cost of:

  • 1 instance(s) $0.136 hourly (100 / 100 Utilized/Month) * 730 hours in a month = $99.28

After upgrading to db.t3.large, the monthly cost equals:

  • 1 instance(s) $0.272 hourly (100 / 100 Utilized/Month) * 730 hours in a month = $198.56

The incremental monthly cost for the database instance:

  • $198.56 - $99.28 = $99.28

Additional Storage for WebSocket Session Management and Logs

Persistent connections will generate additional logging, including WebSocket connection states, session data, and reconnections.

  • The estimated storage need is 5 GB per month for logs and session data.
  • Storage cost for the S3 Standard class is $0.023 per GB

Total monthly cost for storage:

  • 5 GB × $0.023 = $0.12

Summary

Table 8. Summary of Incremental Infrastructure Costs for Option 2

The total estimated monthly cost for option 2 is $190.41, which is 27.16% higher than option 1 ($149.81).

Decision Outcome

Figure 2. Pros & Cons Comparison Quadrant

Chosen Option: Option 2 – WebSocket-based Real-Time Notifications

Why: Option 2 offers a clear advantage by instantly delivering status updates whenever changes occur. This approach meets the sub-two-second response goal and avoids the heavy resource use associated with constant polling in Option 1. WebSockets reduce unnecessary API and database load by sending updates only when needed, which helps control costs as demand grows. Although this option involves a more complex setup, its efficiency in managing high-frequency updates makes it an ideal solution as the business scales.

Decision Date: 14.01.24

Decision Makers: Architecture and Development Teams

Next Steps

  1. Update software architecture documentation using the chosen option.
  2. Define backlog for the feature implementation in the project management tool.

Related Documents

  1. System architecture guidelines for WebSocket integration.
  2. Real-time monitoring and performance optimization standards.

Conclusion

Figure 3. Complete Process

In this scenario, I tried to omit low level details as my primary focus was on calculating the direct infrastructure costs, which cover the basic expenses tied directly to the resources in use, like compute and storage. However, to gain a fuller understanding of the total cost impact, consider other costs like indirect expenses that arise from the time and effort invested by teams (mentioned in the corresponding Required Efforts sections). You can estimate effort costs by analyzing the hourly rates and resource allocation for each task or role involved.

As you may have noticed, decision-making in architecture often goes beyond simply weighing pros and cons. Sometimes, even when all information is available, it takes time to choose the optimal path. Instruments like SWOT Analysis, Weighted Criteria Matrix, Impact and Effort Matrix, Risk Assessment Matrix, and others can be highly effective in making the best decision in more complex scenarios.

What approaches do you rely on for making decisions?

#ArchitectureDecisionMaking #FinOps #SolutionArchitecture #CloudDesign

要查看或添加评论,请登录

Oleksandr Brazhnyk的更多文章

社区洞察

其他会员也浏览了