How and when to introduce architectural changes amid urgent development and production issues?
Most companies start with a minimum viable product to get the first clients, secure investments, and build the team and the necessary processes. But once this first phase is over, it turns out that:
This article focuses on workable approaches to finding the right time and resources to move the product forward without creating unnecessary risks.
Usual development and operational processes
Most companies I worked in or with have some development plan focusing on the functional features. Sometimes, it even includes adding a new architectural component, be it security, networking, database, or DevOps-oriented.
The plan includes functional testing if the company has a QA team. Sometimes, the client also wants integration or user acceptance tests to follow. The missing parts are almost always the most critical architectural qualities:
Use Case - SaaS service for Order Processing
To illustrate, let's take a practical use case - a software-as-a-service product built to automate order processing.
Expected functionality:
Development teams would start developing the functionality. For MVP, a decision might be taken to host the service on AWS and create using Java for the backend and Angular for the front end.
Architectural decisions would include a single relational database like self-managed PostgreSQL or AWS RDS for Postgres, managed by the Cloud provider.
To run faster with the MVP, a monolith REST API service might be developed. Similarly, a monolith Angular project is undertaken.
Issues are handled with some logging, which each developer decides by himself what, when, and where to write, if at all. Probably some primitive form of system health monitoring is introduced to detect if a service is down or still running.
All seems to be fine until 1) production issues start coming in, taking a long toll on the dev team to identify the root cause, fix it, and release it back to the wild, 2) the team grows, and it becomes harder and harder to manage releases of the new software 3) the system becomes painfully slow, especially when a larger bunch of orders come, external systems slow down, heavy reports are run during orders processing.
At this moment, everyone starts to understand that something must be done and soon. But when to go into it, how do we know we're doing the right thing, how do we prioritize the changes, and who can make all these decisions?
Sounds all too familiar? I've encountered this in multiple projects and companies. And it is always challenging, complicated, and involved.
Let there be light - putting the order into chaos
Brainstorming session
Again, from experience, the number one action that could be undertaken in the company is to sit and discuss with the relevant group of knowledge- and stakeholders, asking questions about the current problems the product is facing. The main three categories are outlined below.
System slowness identified in development, testing, and production environments. This can include a) interactive actions, b) API invocations, c) d) events processing, d) background processes, and e) related monitoring and alerting on such conditions.
Potential sources of slowness:
Availability (up/down) of the product to its components. Here, mainly reported issues from all the environments must be collected, counted, and prioritized. Finding out when the product as a whole or its parts were identified as being down, non-responsive, timing out, etc. This can also include platform and infrastructure components.
Here, we pay attention to application, platform, or infrastructure components becoming unavailable or going down. These include
Reliability of the system (is up by keeps failing), differing from general unavailability by the fact that nothing seems to be down, yet the service keeps failing, returning errors, being very slow, timing out, and so on.
领英推荐
Here, it is essential to analyze which flows can fail and which must succeed, which can be safely retried, and which must be attempted only once. This is followed by thinking about possible policies of automatic idempotent retries at different levels, clear logging, intelligent state management, etc. Notably, to complete this analysis, additional knowledge could be needed as to what can be done to manage the chaos.
As a crucial part of this analysis, for each bullet above, an analysis followed by brainstorming can be performed to see if it was easy to identify the issue, perform root cause analysis, outline how to reproduce, etc. Together with proposing and documenting potential improvements.
Observability tooling - the missing ingredient
I must warn that unless the company is already well armed with a good set of observability tools and well-tracked bugs and production issues, only then would it have objective facts and numbers to support the brainstorming sessions above.
Because, if not, then mostly opinions and not largely substantiated claims "in my experience," "I think that, ""it is apparent that", and "everyone knows that", etc would be widely used, resulting in a biased plan of action, not necessary focusing on the most pressing and possible to solve issues.
What tools are needed:
In general, the concept of observability is quite large and critical to understand. Below are two videos I found helpful for this purpose.
Only with these tools, providing objective metrics, logs, and traces, would the team be able to identify the exact sources of the problem and create an actionable plan to improve or eliminate those.
Putting together a plan
At this point, and I can't stress it enough, it is super critical to put the necessary tasks around the prioritized architectural improvements into the plan, with their assigned owners, resources, and timeline.
The tasks list can also include more fluid items, like:
Critical warning
A bit more of advice and warning: often, such architectural improvement efforts are taken sporadically, some work is done, and then again, everyone is moved to feature development of dealing with production issues, forgetting or postponing everything for better times.
This leads to lost effort, frustrations, product not evolving, production issues piling up, good people deciding to leave, and more...
It is vital to realize that architectural improvement is an iterative process, where research, proofs of concepts, developments, testing, and delivery are done time after time, cycle after cycle. And they should be planned this way. Iteratively, from start to finish, in small doses, but constantly.
This is the only viable way to keep evolving the product, reduce technical, specifically architectural, debt, take on bigger and better clients, attract better talents, minimize frustration, and raise pride in the software the company delivers.
That's why I suggest appointing a person who is responsible and has enough authority to drive architectural tasks, making room for them in the company plans, involving the relevant internal and external specialists, and periodically presenting results to the stakeholders.
This can be done by:
About the Author
Alexander Stern has more than 25 years of software engineering experience while working closely with C-level executives to ensure adherence to business needs, vision, and strategy.
He is available for short-term?Architect-as-a-Service?consultations to help businesses make their next evolutionary leap, avoiding pitfalls and taking deliberate, precise steps.
He can be contacted via email at [email protected] or by text message (WhatsApp/Telegram) at +372 56815512