How and when to introduce architectural changes amid urgent development and production issues?
Tracing of a call among different architectural components

How and when to introduce architectural changes amid urgent development and production issues?

Most companies start with a minimum viable product to get the first clients, secure investments, and build the team and the necessary processes. But once this first phase is over, it turns out that:

  • The product does not perform well under load or is simply slow.
  • The product does not scale when new clients start using it simultaneously.
  • Parts of the product keep failing, causing downtimes, production issues, and clients' frustrations.
  • New, more prominent, and established clients demand high availability and performance SLAs.
  • More sophisticated architectures uncover knowledge and skills gaps that are hard to fill within the existing team.

This article focuses on workable approaches to finding the right time and resources to move the product forward without creating unnecessary risks.

Usual development and operational processes

Most companies I worked in or with have some development plan focusing on the functional features. Sometimes, it even includes adding a new architectural component, be it security, networking, database, or DevOps-oriented.

The plan includes functional testing if the company has a QA team. Sometimes, the client also wants integration or user acceptance tests to follow. The missing parts are almost always the most critical architectural qualities:

  • Performance - speed of responses or processing within defined service level expectations and agreements.
  • High Availability - the ability to continue providing the services when critical application and infrastructure pieces fail.
  • Reliability - the ability to provide trustworthy service even when some internal components fail or produce errors.

the image features a soaring rocket symbolizing performance, interconnected nodes for high availability, and a protective shield signifying reliability.

Use Case - SaaS service for Order Processing

To illustrate, let's take a practical use case - a software-as-a-service product built to automate order processing.

Expected functionality:

  • authorized clients management
  • external logistics integrations
  • actual order processing
  • reporting

Development teams would start developing the functionality. For MVP, a decision might be taken to host the service on AWS and create using Java for the backend and Angular for the front end.

Architectural decisions would include a single relational database like self-managed PostgreSQL or AWS RDS for Postgres, managed by the Cloud provider.

To run faster with the MVP, a monolith REST API service might be developed. Similarly, a monolith Angular project is undertaken.

Issues are handled with some logging, which each developer decides by himself what, when, and where to write, if at all. Probably some primitive form of system health monitoring is introduced to detect if a service is down or still running.

All seems to be fine until 1) production issues start coming in, taking a long toll on the dev team to identify the root cause, fix it, and release it back to the wild, 2) the team grows, and it becomes harder and harder to manage releases of the new software 3) the system becomes painfully slow, especially when a larger bunch of orders come, external systems slow down, heavy reports are run during orders processing.

At this moment, everyone starts to understand that something must be done and soon. But when to go into it, how do we know we're doing the right thing, how do we prioritize the changes, and who can make all these decisions?

Sounds all too familiar? I've encountered this in multiple projects and companies. And it is always challenging, complicated, and involved.

Let there be light - putting the order into chaos

Modern complex system architecture

Brainstorming session

Again, from experience, the number one action that could be undertaken in the company is to sit and discuss with the relevant group of knowledge- and stakeholders, asking questions about the current problems the product is facing. The main three categories are outlined below.

System slowness identified in development, testing, and production environments. This can include a) interactive actions, b) API invocations, c) d) events processing, d) background processes, and e) related monitoring and alerting on such conditions.

Potential sources of slowness:

  • Insufficient resources, especially CPU, are allocated to virtual machines or pods. These include those where application services run on virtual machines or Kubernetes pods, as well as application platforms, including databases, distributed caches, messaging services, service registries, security-related services, and so on.
  • The application is built without scaling in mind. For instance, it has one of multiple threads, which cannot scale to additional machines.
  • Kubernetes scaling and auto-scaling are not applied, resulting in the system being unable to withstand the load or bursts.
  • Database queries are not optimized, or there are not enough indices to speed up the queries.
  • Queries try to encompass big data without sharding or other big data processing strategies.
  • Network latency between multiple services or data stores slows down the processing.
  • Message queues in event-driven systems are not consumed fast enough, and consumers are slow and not scaling well.
  • As part of processing, external services are used without taking network latency into account.
  • At the same time as critical processes run, the same application and platform resources are consumed by heavy interactive actions, like big reports generation. A typical example is a heavy report running on the same database where many users use the interactive application.
  • The product could be built with too many small microservices talking to each other over secure JSON/HTTP interfaces without considering the overhead such conversations create when not done correctly.

Availability (up/down) of the product to its components. Here, mainly reported issues from all the environments must be collected, counted, and prioritized. Finding out when the product as a whole or its parts were identified as being down, non-responsive, timing out, etc. This can also include platform and infrastructure components.

Here, we pay attention to application, platform, or infrastructure components becoming unavailable or going down. These include

  • application crashes due to bugs or missing validations
  • databases going down or becoming extremely slow due to irresponsible queries
  • not using platform high availability clusters
  • network loss or timeouts
  • security restrictions due to misconfiguration
  • external services becoming unavailable due to networking or reasons not controlled by the company
  • whole data centers going down in planned or unplanned fashion
  • etc.

Reliability of the system (is up by keeps failing), differing from general unavailability by the fact that nothing seems to be down, yet the service keeps failing, returning errors, being very slow, timing out, and so on.

Here, it is essential to analyze which flows can fail and which must succeed, which can be safely retried, and which must be attempted only once. This is followed by thinking about possible policies of automatic idempotent retries at different levels, clear logging, intelligent state management, etc. Notably, to complete this analysis, additional knowledge could be needed as to what can be done to manage the chaos.

As a crucial part of this analysis, for each bullet above, an analysis followed by brainstorming can be performed to see if it was easy to identify the issue, perform root cause analysis, outline how to reproduce, etc. Together with proposing and documenting potential improvements.

Observability tooling - the missing ingredient

I must warn that unless the company is already well armed with a good set of observability tools and well-tracked bugs and production issues, only then would it have objective facts and numbers to support the brainstorming sessions above.

Because, if not, then mostly opinions and not largely substantiated claims "in my experience," "I think that, ""it is apparent that", and "everyone knows that", etc would be widely used, resulting in a biased plan of action, not necessary focusing on the most pressing and possible to solve issues.

What tools are needed:

In general, the concept of observability is quite large and critical to understand. Below are two videos I found helpful for this purpose.

Only with these tools, providing objective metrics, logs, and traces, would the team be able to identify the exact sources of the problem and create an actionable plan to improve or eliminate those.

Putting together a plan

At this point, and I can't stress it enough, it is super critical to put the necessary tasks around the prioritized architectural improvements into the plan, with their assigned owners, resources, and timeline.

The tasks list can also include more fluid items, like:

  • learning and research;
  • consultations with internal or external knowledge holders;
  • proofs of concepts;
  • installation of the necessary tooling;
  • provisioning of environments;
  • running tests before and after;
  • and so on.

Critical warning

A bit more of advice and warning: often, such architectural improvement efforts are taken sporadically, some work is done, and then again, everyone is moved to feature development of dealing with production issues, forgetting or postponing everything for better times.

This leads to lost effort, frustrations, product not evolving, production issues piling up, good people deciding to leave, and more...

It is vital to realize that architectural improvement is an iterative process, where research, proofs of concepts, developments, testing, and delivery are done time after time, cycle after cycle. And they should be planned this way. Iteratively, from start to finish, in small doses, but constantly.

This is the only viable way to keep evolving the product, reduce technical, specifically architectural, debt, take on bigger and better clients, attract better talents, minimize frustration, and raise pride in the software the company delivers.

That's why I suggest appointing a person who is responsible and has enough authority to drive architectural tasks, making room for them in the company plans, involving the relevant internal and external specialists, and periodically presenting results to the stakeholders.

This can be done by:

  • the CTO himself (responsible for the company technology)
  • Enterprise Architect (connecting software and the business)
  • Solution Architect (dealing with software serving clients needs)
  • Software Architect (architecting a software product) or
  • Tech Lead (responsible for the service the team he or she is in delivers).

Photo of a seasoned technologist with his colleagues, deeply involved in a collaborative discussion, analyzing a system architecture diagram displayed on the monitors, exchanging ideas, indicating a collective effort to align the architecture with business needs.

About the Author

Alexander Stern has more than 25 years of software engineering experience while working closely with C-level executives to ensure adherence to business needs, vision, and strategy.

He is available for short-term?Architect-as-a-Service?consultations to help businesses make their next evolutionary leap, avoiding pitfalls and taking deliberate, precise steps.

He can be contacted via email at [email protected] or by text message (WhatsApp/Telegram) at +372 56815512


要查看或添加评论,请登录

社区洞察

其他会员也浏览了