Microservices Architectures: 3 Overlooked Considerations
Published originally on Netsil blog
Microservices architectures have become a ubiquitous industry trend because of their promise of speed, agility and scale. Just like any major paradigm shift, microservices adoption introduces many changes at architectural, technical and organizational levels. It is not uncommon to overlook some considerations in pursuit of microservices-based applications. While containers, orchestrations, automation, service definition, etc. take up the majority of the mindshare, here we have identified 3 often overlooked but critical considerations based on our customer conversations.
An overarching takeaway is that in a microservices architecture complexity is shifting from being code-based to network-based. Therefore, service-interactions become a much richer source of information for health, performance and security of microservices applications. Netsil has leveraged this key insight and built the Application Operations Center (AOC) using service interactions as the source of truth. As we explore the considerations mentioned below, we will highlight how the AOC benefits SREs and operations teams in improving reliability and delivering service-level objectives (SLOs) for their microservices-based applications.
Addressing the Common Requirements of Service-Interactions
In microservices architectures, services interact heavily with each other to fulfill transactions. Each of these service interactions requires a subset of these common functionalities:
- Connection pooling to avoid expensive overhead of creating new connection for every request
- Automatic timeouts and retries for idempotent requests
- Ability to dynamically shift interaction to another replica of a service when initial target is experiencing latency. Unlike simple pulse-based load balancing, this goes further into leveraging more metrics to identify the right instance of a service
- Ability to flexibly route traffic among multiple instances of services. This is useful for blue-green and canary release management
- Leverage modern microservices glue protocols such as gRPC, Thrift etc. that are more efficient to transmit on the wire, offer request multiplexing and pipelining than HTTP
- Automatic or manual dependency modeling/mapping for documentation and quick incident response purposes
- Build, expose and maintain good APIs and contracts. Also, constantly address forward and backward compatibility
One option is to build all these functionalities into each service interaction. A more efficient option is to employ lightweight proxy services such as linkerd, Envoy or Traefik. From the perspective of operations teams, these additional components are critical dependencies that they should be able to visualize and monitor. The Netsil AOC has the strong advantage of automatically discovering all service interactions with such intermediate components without requiring any code change! Operations teams will see these components and their dependencies on a real-time map along with all the key performance indicators such as latency, throughput and error rates. The below picture shows the Kubernetes L7-Proxy as an example of the AOC’s complete visibility into the hard-to-instrument intermediate services.
Fig 1: Kubernetes Pods Shown on an Application Topology
Think Shared Services Not Shared Libraries
In any microservices application, many services will have a need for some common functions. Cache, key-value stores, distributed lock managers and service discovery are good examples of such common functions used by multiple services. In the monolithic world, multiple components could simply use shared libraries for such common functions. And with handy techniques such as Docker or Packer, it might be tempting to bundle the shared libraries with all the services that use them. But continuing down the path of shared libraries becomes detrimental for at least 2 reasons:
- Managing multiple copies of shared libraries and their dependencies would soon become a nightmare. Also, this runs foul with the core principles of microservices i.e bounded context of services
- The situation becomes unsustainable when dev teams start using multiple programming languages and frameworks. Then it would be absolute waste to build and maintain multiple copies of shared libraries in different languages
The paradigm of building and using shared services can be seen in Netflix EVcache, the popularity of redis and etcd for key-value stores, and growing use of products such as Consul for service discovery.
While it is important to realize and build shared services, it is equally important to ensure their health, availability, and performance. At Netsil, we’ve come across both providers and consumers of such shared services which may be from same or different teams. If you are a consumer team then quite often you don’t know the details of the service itself but are concerned about its latency, throughput and error rates. If you are a provider of shared services, you have additional concerns regarding saturation and capacity of these services.
All of these “golden signals” are available to Netsil users irrespective of whether they own the services or are merely consuming them. Since Netsil leverages service interactions as the source of truth, it can monitor and present latency, throughput, error rates and saturation without requiring any code instrumentation of these services! As an example, the picture below captures the golden signals for a shared Memcached service.
Fig 2: Golden Signals for a Memcached Service
Don’t Forget The Circuit Breakers
Failure is inevitable in a distributed system. The role of circuit breakers is to contain the failure and avoid propagating isolated failures to the entire application.
Let’s say service A calls service B which in turn calls service C to fulfill a particular transaction. If downstream service C starts experiencing errors and experiences a time-out then service B and eventually service A will both start failing. In real world scenarios when such failures are left unchecked it can have disastrous effect impacting transaction integrity, causing data inconsistency and resulting in widespread outages of multiple services. Instead, if service B implements a circuit breaker paradigm, then the circuit breaker will monitor for failure of service C and when failures exceed specified thresholds it can invoke graceful error handling.
While it is important for development teams to implement circuit breakers for their services, it is equally important for the operations team to get alerted on “circuit break” incidents. Netsil provides multiple important features for handling such incidents:
- A real-time auto-discovered map of the entire microservices application. Using this map, operations teams can quickly visualize and understand service dependencies.
- Operations teams can define and monitor KPIs at the service level rather than worrying about individual instances. They get alerted when services start experiencing issues potentially even before a circuit break kicks in.
- Commonly circuit breakers are implemented using wrapper functions such as Netflix Hystrix. Netsil supports gathering metrics from circuit breaker functions using the standard statsd protocol. Circuit breaker functions can simply send incident metrics using statsd and Netsil will enable the analytics and alerting workflows on the metrics.
Fig3: Netsil Enables Service-level Monitoring to Alert on Circuit Break Conditions
Conclusion
Microservices bring a lot value and a lot of changes. We hope this blog has put the spotlight on the considerations that you will take in to account. Based on your experience, if you have other important considerations then do share them in the comments section below.
Also we have highlighted the value of the Netsil Application Operations Center (AOC) for operations teams responsible for health and performance of microservices applications. You can check out the Netsil AOC here and we look forward to engaging with you on your microservices efforts.