Distributed System - Lessons Learned
A distributed system is a group of computers/nodes working together to achieve a common goal (e.g., solving computation problem, data storage problem etc).
Cloud-native distributed systems are the driving force in digital transformation. They help organizations to achieve the agility they need to accelerate, innovate, and scale.?However, the complexity of managing these ecosystems is increasing dramatically as they scale and become more dynamic. In last one year our team worked on writing one of such distributed systems, which gave us some good learning and insights. I have gathered following common but practical points that I assume might help someone who is either planning or already developing such project.?
1.??Change Mindset Before Diving Into Distributed System
Software designing is more of a philosophy and strategy than code itself. It is often the case that people who are used to working in simple centralised architecture find is difficult to accept the facts surrounding distributed architecture. What they end up doing is try to build small monoliths wrapped around with web APIs, which gives an illusion of distributed system (aka distributed monoliths). To understand the philosophy and culture of distributed system, we need to go back and look at the problems that gave rise to such system, and then walk through the evolution of distributed systems of today. In last decade we have seen evolution of what known as microservices architecture; it is one type of distributed system. Problem that we face today is that there are lot of people who just left monolith systems and jumped directly on to “microservice” wagon. And given the state of ambiguity around what microservices are, they are left highly confused. I would highly encourage you to follow few pioneers in this space and listen/ read them such as Martin Kleppmann, Joe Armstrong, Leslie Lamport etc.
2.???Data is the Most Critical and Core Component?
The system that we worked on needed decent level of data calculation and manipulation. We realised very early on that our entire system’s performance, accuracy and scalability would depend on robust yet adaptive data structure that can be denormalised and scaled out. We read many papers and designs around how different people have approached data structure in our business domain. Lesson we got is that try to spend as much time as possible thinking around data structure because we knew that a bad data structure will be very expensive to live with and much more expensive to replace. Data structure not only matters when you persist data but also matters for your message broker, event streams, code that processes your data etc. Secondly, The choice of database and understanding of distributed transaction/ consistency of replication is extremely important. As soon as you distribute your data, you are into world partitioning and scaling which will spreads your data and data-processing across multiple nodes. You must read around how consensus algorithms work, at least at a high level. It really helps when you work with highly scalable databases.??I would highly recommend reading Designing Data-Intensive Applications by Martin Kleppmann and Principles of Distributed Database Systems by Valduriez & ?zsu.
3.???Ensure End-to-End Parallelism
One of the major benefits of distributed system is that it can horizontally scale, and with that comes the power of parallel processing. While it’s a good thought, many system designs are infected with some level of bottlenecks. A bottleneck is a resource which limits the maximum performance of the entire service. Bottlenecks often include both software resources such as threads, locks, and channels, and hardware resources such as processors, memories, and disks. We need to ensure that when we design distributed systems, we think of it from point of “unit of scalability”. This unit serves a given business domain and provides and-to-end business functionality such that when we need more parallelism, we simply keep adding a new unit, and system can scale out.?You should try to read paper "Exploring Efficient Microservice Level Parallelism" by X. Wang & Others where they have very thoroughly touched on the topic of Microservice Level Parallelism (MLP).
4.?? Be Strict with Service Boundaries??
While designing the system, always keep service boundaries in mind. It’s very easy get it wrong so quickly and end up having either a highly decoupled system that is inefficient or highly inter-dependent systems that are just hard to scale out. It is therefore extremely important to understand where service boundary starts and where it ends, and moreover where does these service boundaries fit in the grand scheme of business domains and sub-domains. While you define these service boundaries, also look at the related databases, event streams etc., and ensure that the clear separation runs throughout the system, and not only at the API layer.??Last point to mention here is around stateful or stateless services, be very thoughtful about it.
5.???Draw Every Box and Arrow in Detail
One thing we never shy away is drawing, rather than plain verbal discussions or long texts. When you draw — you are giving yourself a report on how things will interact, what is the domain, context, boundaries, etc. We found that the better we got with our drawings, the more thorough and detailed analysis we could have. Try to draw every box & arrow in your system, and then go on to add details to those boxes & arrows. Most of the time our diagrams went into multiple integrations before we could have any meaningful analysis. And often we found our own misunderstandings or rather absence of understanding while drawing. In my opinion, diagrams must be self-descriptive, consistent, accurate enough and connected to the code. Let diagram drive the code, and what you will have is a visual aid to define all your distributed test cases.??"Design for visibility to make inspection and debugging easier" ―? Basics of the Unix Philosophy
6.???Promote Code Agnostic or Tech-Stack Independent Thinking?
Analysis is limited for those who have mostly worked on implementing solutions using a specific tech-stack and do not understand the technical theory used behind it. The problem they get into, and that’s my personal view, is they have a very narrow approach to thinking only in terms of SDKs or APIs they know, rather than questioning on basics of the technology itself. I often refer to generic terms such as message broker or event stream just to ensure that discussion does not becomes bias or aligned towards a certain tech-stack. Having said that, there are some specific implementations which are only available within certain tech stack. Try to steer the discussion using generic technical terms, and it will help to widen the scope of possible solution. Always remember, choose technology around a solution, and do not build a solution around a particular technology.??
7.???Measure Everything and Make No Assumptions?
As the legendary engineer W. Edwards Deming put it, “Without data, you’re just another person with an opinion.” Your data and processing are scattered all around on different nodes, and these nodes are all communicating over network which may or may not be reliable. This put a massive challenge to get reliable metrics out and to pin down where the problems could be when these systems are subjected to load or spikes. We, in our team, started to put benchmarks in all our codebase just to ensure that we know numbers while no network latency is taken into consideration. We also ensured that there was enough knowledge in the team around individual services resource utilisation when run in isolation. The last thing you need is to put numbers in/out every network touch point such that you know by numbers how each service is performing.?If you are using Cloud services, do not just go in believing by numbers on their performance page, as it may surprise you to differ when you put it through your load. There are so many factors that can affect the performance of Cloud services that you may not know, hence measure it too for your profile. As Rob Pike puts it, "don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is".
领英推荐
8.???Think IDEALS, rather than SOLID?
Although some of the SOLID principles apply to microservices, object orientation is a design paradigm that deals with elements (classes, interfaces, hierarchies, etc.) that are fundamentally different from elements in distributed systems in general, and microservices in particular. System in distributed architecture is often isolated in terms of individual responsibility, small processes and data, that are communicating over network. There isn’t a single massive codebase that needs to be overloaded with interfaces, Dependency Injections, DTO and DAOs, required for enterprise level monoliths. Some of the principles of SOLID are included in IDEALS too, but IDEALS also incorporates principles that are specific to microservice-based architecture.
9.?????Always Account for Failures
Not only that one must know about "8 fallacies of distributed computing", he/she must also know the limitations and mitigations for such scenarios. Resiliency is the ability of a system to gracefully handle and recover from failures, both inadvertent and malicious. It isn't about avoiding failures but accepting the fact that failures will happen and responding to them in a way that avoids downtime or data loss. It should be designed to cope with partial failures, like network outages or nodes or VMs crashing in the Cloud. The thing that we noticed is that although many people talk about it, there isn’t a great effort that many put to test failures scenarios and ensuring the system resiliency. Also, if your failures are not triggering alerts from right zone of your distributed architecture then you are in big trouble. There are well defined patterns that you shall learn about such as Bulkhead, Circuit Breaker, Retry, Queue-Based Load Levelling.?
10.???Observability in Distributed World
When services are running in their own clusters, and each service is running on multiple nodes, it just a nightmare if you do not understand concept of distributed tracing and logging as well as what binds them together.??Often times, we learn about absence of distributed logging & tracing when something has gone wrong, and we were unable to find out why. By that time, it’s too late. Thinking it all end-to-end, mocking scenarios of failure and test logs & traces is a good exercise. Logging and Tracing remains quite undefined and untested in many parts of distributed systems. There is another area that needs to be considered, that is differentiating between Application and Audit logs, used quiet interchangeably in many places and hence the confusion of their importance. Application Logs typically means the recording of implementation level events that happen as the program is running (methods get called, objects are created, etc.). As such it focuses on things that interest programmers. Auditing is about recording domain-level events: a transaction is created, a user is performing an action, etc. In certain types of application there is a legal obligation to record such events.
11.???Dependency Coefficient
Regardless of number of services you have in a distributed architecture, it’s always important to keep a close eye on dependency coefficient among services. The moment we introduce unmanageable strong dependency between our services, we lose the potential advantages of a distributed (microservices) architecture such as we can no longer deploy services independently, our teams will spend more time in sync meetings than in the code, and so on. There is almost not much interest in this area until recently but I suppose the pain of microservices has pushed many to look towards some sort of dependency graph. we should monitor the key architectural properties regularly so that we can act upon dependencies early before it becomes a tangling problem.
12.???Incremental Architecture Approach?
If you cannot design a good modular monolith, the chances of you designing a good distributed system are quite slim. Within our team, we put everything as a modular monolith first, and decided overtime where it made sense to distribute it over microservices. Meaning we knew that there would be multiple iterations of our design, but the fundamentals of data structure and domain boundary remained solid even in our modular monoliths. It might look like an approach that is time consuming, but on contrary this approach gives you pace of evolution.??eBay, which started in 1995 as a three-day weekend project, not as a proof of concept (PoC) for a business, but to see if it was possible to do something interesting on the web. Today, it’s on its 5th complete rewrite of the infrastructure and it is a polyglot set of microservices. Twitter, Amazon and many other big companies have gone through a similar evolution. So, the lesson here is creating a good fundamental core modular monolith, prove it works, scale it to a point where distributed architectures starts to add value. Do not start distributing everything from day one.?
13.???Something Better is Coming Soon
Technology is constantly evolving, and much faster than before. Whatever tech-stack you are using today will be old, most likely sooner than you think. Hence, there is a very important job at hand here, to decouple domain knowledge totally away from technology. Some system designs cannot distinguish between technology and domain knowledge and end up having a system that needs complete re-write as technology evolves. Distributed system concepts are evolving faster than before, and much faster with more reliable networks and ease of scaling out infrastructure. So we should try to find a way that keeps our domain knowledge safe and robust, while we can switch and replace surrounding tech-stacks.?
???
14.???Your Journey Ends with Your Customer
It’s extremely important to realise that everyone who is consuming your services, be it an external user or an internal team, they all are your customers. You are eventually generating value for someone. Your journey does not end by implementing a distributed system, your journey end when the consumers of your distributed system/ microservice gives you credit with the value you added to them. Mostly in the world of distributed systems, especially for teams that are backend heavy do not see the first-hand appreciation of work they do, even after the fact it’s the distributed system that is the future of most businesses who want to digitally transform and scale out to ever increasing consumer markets on web. That is why it’s very important to bring them forward, include them in meeting where they can see result of their work.?
The team contributed mostly involved Simforms's developers Vishvas Prajapati , Vishit Shah , Terrence Shebuel A. , Parth Doshi , Pankaj Thakur , Avinash Sutariya , last but not the least (everyones' favourite) Chudasama Brijrajsinh
Software engineering | Microsoft Tech | Avid reader
2 年There is a broad perspective for the work we do as software engineers, but often times, with the rush to code, agile, scrum, deadlines, delivery and what not, it's imperative that we should regather and reflect on our work every now and then. This post does summarize not the code, technologies, cloud services, stories, but the essential software engineering apportion. Satya .. ji ??