Cost-Centric Architectural Decision Process
Oleksandr Brazhnyk
Strategic Technology Leader // Designing High-Quality Solutions and Value-Focused Consulting Services
Today, I’d like to share the architecture decision-making process I’ve relied on for many years. Starting from the moment a new requirement emerges and moving to the point where we’re ready to design an architectural change.
Making proactive architectural decisions requires a clear understanding of the current business landscape and the technical environment. We’ll begin by establishing this foundation in the background section. Next, we’ll examine how to assess the impact of a new non-functional requirement. To do this, we’ll develop a Quality Attribute Scenario (QAS) to clarify how the requirement impacts the architecture and identify areas that require attention.
Once we fully understand the requirement and QAS, we’ll move to an Architecture Decision Record and compare two possible solutions. Each option will be evaluated for how well it meets the new requirement and for its practical feasibility. By weighing the reasoning behind each choice, we’ll reach a sound decision that sets a robust future state architecture.
While this article focuses on rationale-based decision-making, it will not cover the design part of the selected option. Here, we focus on preparing for change by making architectural decisions that are context-rich, thoroughly analyzed, and aligned with a clear vision of future needs.
Background
ModaMosaic, a renowned fashion house, is known for its high-end and exclusive product lines. One of its premium items, the Midnight Cascade dress, is in high demand and requires meticulous order management and data handling. The organization relies on an interconnected system called Midnight Cascade, which consists of a mobile application, API, back-office UI, database, and image storage to manage orders, update product data, and process inventory. The business expects a surge in sales of the Midnight Cascade dress soon, driven by an aggressive marketing campaign.
High-Level Architecture View
The current logical architecture without technology mapping for the scenario looks like the following:
Order Placement by Store Managers
Daily, from 1000 to 1500 store managers at various ModaMosaic stores interact with the mobile application to submit customer orders for the Midnight Cascade dress.
When a customer requests an order, the store manager logs in to the frontend system, navigates to the product catalog (populated by data fetched from the API), and places the order. The mobile application sends this order data to the API, which records it in the database. Any product images generated in the mobile application related to the dress are also uploaded to the API and stored in image storage for easy retrieval.
The real challenge is that order updates from the manufacturing side can take hours or even days, often requiring phone calls with the order fulfillment team to confirm if the dress is in production. It is not a human issue. The maintenance team has identified that the order status updates are getting stuck on the API side. Due to the presence of extensive legacy code, resolving the issue isn't straightforward and would require significant refactoring of the component.
Order Management by the Order Fulfillment Team
The order fulfillment team at ModaMosaic includes 10 people and operates on the manufacturing side. The team's primary function is to process and update orders in case of any changes caused by the manufacturing side. The team would greatly benefit from real-time analytics dashboards to monitor orders.
Once store managers submit orders through the mobile application, the back-office UI fetches this order data from the API, allowing the team to review and manage incoming requests. The team accesses the back-office UI to fetch these orders, processes them to check inventory availability and manufacturing schedules, and updates the order status accordingly. This interaction triggers the API to save updated order statuses to the database.
Product Data Management by the Product Team
The product team manages and updates product data for items like the Midnight Cascade dress. The team size is around 10 people, but only a few (2-3) work with the product. When new product details need to be upserted (updated or inserted), the product team accesses the back-office UI. They log in and navigate to the product management section to make changes. The back-office UI communicates with the API to push these updates, storing the latest product data in the database. If any new images or revisions are needed, the UI pulls the images from image storage for review or updates.
Scenario Walkthrough
A store manager at a ModaMosaic location receives an in-store request for the Midnight Cascade dress. They use the mobile application to place an order, which the API saves in the database.
Once the order is received, the order fulfillment team fetches the new order details from the back-office UI, checks for manufacturing feasibility, and updates the order status. The update is then sent back to the API and stored in the database, allowing the store manager to view the status through the mobile application.
The product team, noticing a recent fabric update to the dress, logs into the back-office UI to update the product description and upload a revised image, which the API processes and saves to the image storage.
Core Infrastructure
The Midnight Cascade solution is hosted on AWS in the US East (Ohio) region and leverages on-demand pricing.
Below is a mapping table outlining the relationship between the system components and the corresponding cloud services.
New Business Requirement
To speed up product delivery and enhance customer satisfaction, ModaMosaic wants to implement real-time order status updates for every order placed for high-demand products, including the Midnight Cascade dress. This capability should allow store managers and the order fulfillment team to track orders through each stage in real-time (from placement to manufacturing to shipment and completion). This enhancement must reduce the need for manual follow-up.
Quality Attribute Scenario
The quality attribute scenario is one of the most powerful tools in a solution architect's arsenal. A quality attribute scenario is a structured way to represent a system's response to a particular stimulus, clarifying non-functional requirements. These scenarios define how a system should respond under certain conditions. Let’s transform the new business requirement into a quality attribute scenario.
Architecture Decision Record
Context
ModaMosaic needs to provide real-time updates on order statuses for the Midnight Cascade dress across its mobile application and back-office UI. The idea is to improve the responsiveness of order tracking for store managers and the order fulfillment team. The target latency for reflecting any change in order status should be under two seconds.
Decision Drivers
Considered Options
Option 1: Polling-Based Updates with Reduced Frequency
This approach leverages periodic polling from both interfaces to retrieve updated order information from the API service.
Instead of implementing continuous polling, which would increase API request costs and load on the backend, the polling frequency is set to an optimal interval of two seconds.
Behind the Scenes
Pros
Cons
Required Efforts
Incremental Infrastructure Costs
Infrastructure costs could rise depending on how often polling occurs and the volume of orders.
Compute for the API Service
We have to consider the increased load caused by frequent API polling and the potential need for auto-scaling. Additional details:
Using the details above, let's calculate the monthly cost for running these instances in the following two scenarios:
If we assume that peak load occurs approximately 40% of the time and normal load occurs 60% of the time, we can calculate blended monthly costs:
To find the incremental monthly cost, we have to exclude the price of two already running API instances from the blended cost.
The estimated blended monthly cost for API compute:
Load Balancer for API Service Instances
We must consider the load balancer costs of distributing traffic between the API Service instances.
Cost per month:
Currently, the load balancer processes ~75 GB of data per month:
Assuming ~100 GB of new data will be processed by the load balancer per month after implementing the change, this adds:
Therefore, the total load balancer cost is $30.2 per month, whereas $8 is the incremental cost.
领英推荐
Database Instance Upgrade
The polling approach doesn’t significantly impact database write operations but does require vertical scaling of the existing database instance to handle frequent read requests.
The existing db.t3.medium instance incurs a monthly cost of:
After upgrading to db.t3.large, the monthly cost equals:
Incremental monthly cost for the database instance:
Summary
The total estimated monthly cost for option 1 is $149.81.
Option 2: WebSocket-based Real-Time Notifications
The system establishes a persistent WebSocket connection between the client (back-office UI or mobile application) and the server. Thus, the server can instantly push updates to the client whenever changes occur in the system.
In this approach, we will separate the WebSocket implementation from the existing API service, transitioning gradually to the new setup. Consequently, initial costs will be higher since the WebSocket service will be deployed on a separate set of cloud instances, potentially conflicting with decision drivers #3 and #4. The development team expects to move more features to the WebSocket server over time. However, the API service will remain in place, as using WebSockets for infrequent requests is impractical. Over time, API costs are expected to decrease as the workload shifts.
Behind the Scenes
Pros
Cons
Required Efforts
Incremental Infrastructure Costs
Let's consider the primary components required and estimate the costs based on the increased demand from maintaining open connections for real-time updates.
Compute for WebSocket Service
Since WebSocket connections are persistent, they will demand more memory and CPU resources than typical HTTP requests.
Based on the resources currently allocated for the API workload, we anticipate that two t3.medium instances will offer adequate capacity to support the WebSocket approach, with the possibility of scaling out to three instances.
Let's calculate the monthly cost for running these instances in the following two scenarios:
If we assume that peak load occurs approximately 20% of the time and normal load occurs 80% of the time, we can calculate blended monthly costs:
The estimated blended monthly cost for the WebSocket compute instance is $66.81.
Load Balancer for WebSocket Connections
WebSocket connections typically require a load balancer configured to handle sticky sessions and WebSocket compatibility.
Cost per month:
Assuming ~100 GB of new data will be processed by the load balancer per month after implementing the change, this adds:
Therefore, the total load balancer cost is $24.2 per month.
Database Instance Upgrade
More frequent real-time updates may require a higher-capacity database instance to manage the increased read and write operations.
The existing db.t3.medium instance incurs a monthly cost of:
After upgrading to db.t3.large, the monthly cost equals:
The incremental monthly cost for the database instance:
Additional Storage for WebSocket Session Management and Logs
Persistent connections will generate additional logging, including WebSocket connection states, session data, and reconnections.
Total monthly cost for storage:
Summary
The total estimated monthly cost for option 2 is $190.41, which is 27.16% higher than option 1 ($149.81).
Decision Outcome
Chosen Option: Option 2 – WebSocket-based Real-Time Notifications
Why: Option 2 offers a clear advantage by instantly delivering status updates whenever changes occur. This approach meets the sub-two-second response goal and avoids the heavy resource use associated with constant polling in Option 1. WebSockets reduce unnecessary API and database load by sending updates only when needed, which helps control costs as demand grows. Although this option involves a more complex setup, its efficiency in managing high-frequency updates makes it an ideal solution as the business scales.
Decision Date: 14.01.24
Decision Makers: Architecture and Development Teams
Next Steps
Related Documents
Conclusion
In this scenario, I tried to omit low level details as my primary focus was on calculating the direct infrastructure costs, which cover the basic expenses tied directly to the resources in use, like compute and storage. However, to gain a fuller understanding of the total cost impact, consider other costs like indirect expenses that arise from the time and effort invested by teams (mentioned in the corresponding Required Efforts sections). You can estimate effort costs by analyzing the hourly rates and resource allocation for each task or role involved.
As you may have noticed, decision-making in architecture often goes beyond simply weighing pros and cons. Sometimes, even when all information is available, it takes time to choose the optimal path. Instruments like SWOT Analysis, Weighted Criteria Matrix, Impact and Effort Matrix, Risk Assessment Matrix, and others can be highly effective in making the best decision in more complex scenarios.
What approaches do you rely on for making decisions?
#ArchitectureDecisionMaking #FinOps #SolutionArchitecture #CloudDesign