Building Reliable, Scalable and Maintainable Applications
When this creature is used in cover page of DDIA, why shouldn't I use it as title image of my summary :-)

Building Reliable, Scalable and Maintainable Applications

A data-intensive application is typically built from standard building blocks. They usually need to:

  • Store data (databases)
  • Speed up reads (caches)
  • Search data (search indexes)
  • Send a message to another process asynchronously (stream processing)
  • Periodically crunch data (batch processing)

Three important concerns in most software systems are reliability, scalability, and maintainability:

  • Reliability:?The system should work correctly (performing the correct function at the desired level of performance) even in the face of adversity.
  • Scalability:?As the system grows(in data , traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
  • Maintainability:?People should be able to work on the system productively in the future.

Reliability

No alt text provided for this image

Typical expectations for software to be termed as reliable are that:

  • The application performs as expected.
  • It can tolerate user mistakes or unexpected usage of the product.
  • Performance is good enough for the required use case, under the expected load and data volume.
  • The system prevents any unauthorized access and abuse.

To put it in a nutshell: System should continue to work correctly, even in the face of faults and human errors.

Basically, a reliable system is fault-tolerant or resilient. Fault is different from failure. A fault is one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to a user.

It's impossible to reduce the probability of faults to zero, thus, it's useful to design fault-tolerance mechanisms that prevent faults from causing failures.

Some approaches for building reliable systems, in spite of unreliable human actions include:

  • Build redundant system: This approach involves using multiple redundant systems to handle failures. In the event of a failure, one of the redundant systems can take over to ensure continuity of service.
  • Fault Tolerance: This approach involves designing systems to withstand failures without disruption to the service. This can be achieved through the use of techniques such as failover, replication, and self-healing.
  • Resilience: This approach involves designing systems to quickly recover from failures. This can be achieved through the use of techniques such as rapid recovery, self-healing, and rolling updates.
  • Monitoring and Alerting: This approach involves monitoring the system for failures and triggering alerts when failures occur. This can be used to quickly detect and respond to failures and to improve the overall reliability of the system.
  • Load Testing and Performance Tuning: Load testing and performance tuning are used to ensure that a system can handle the expected load and to identify and resolve performance bottlenecks.


Scalability

No alt text provided for this image

As system grows, there should be reasonable ways for dealing with that growth.

Describing Load

Load can be described by the?load parameters.?The choice of parameters depends on the system architecture. It may be:

  • Requests per second to a web server
  • Ratio of reads to writes in a database
  • Number of simultaneously active users in a chat room
  • Hit rate on a cache.

Describing performance

What happens when the load increases:

  • How is the performance affected?
  • How much do you need to increase your resources?

In a batch processing system such as Hadoop, we usually care about?throughput, or the number of records we can process per second.

Latency and response time : The response time is what the client sees. Latency is the duration that a request is waiting to be handled.

It's common to see the?average?response time of a service reported. However, the mean is not very good metric if you want to know your "typical" response time, it does not tell you how many users actually experienced that delay.

Better to use percentiles.

  • Median?(50th percentile?or?p50). Half of user requests are served in less than the median response time, and the other half take longer than the median.
  • Percentiles?95th,?99th?and?99.9th?(p95,?p99?and?p999) are good to figure out how bad your outliners are.

Percentiles in practice: Calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete. The chance of getting a slow call increases if an end-user request requires multiple backend calls.

Approaches for coping with load

  • Scaling up?or?vertical scaling: Increasing the power of a machine
  • Scaling out?or?horizontal scaling: Distributing the load across multiple smaller machines.
  • Elastic?systems: Automatically add computing resources when detected load increase. Quite useful if load is unpredictable.


Maintainability

No alt text provided for this image

Design software in a way that it will minimize pain during maintenance, thereby avoiding the creation of legacy software by ourselves. Three design principles for software systems are:

Operability: Making Life Easy For Operations

Operations teams are vital to keeping a software system running smoothly. A system is said to have good operability if it makes routine tasks easy so that it allows the operations teams to focus their efforts on high-value activities. Data systems can do the following to make routine tasks easy e.g.

  • Providing visibility into the runtime behavior and internals of the system, with good monitoring.
  • Providing good support for automation and integration with standard tools.
  • Providing good documentation and easy-to-understand operational model ("If I do X, Y will happen").
  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed.

Simplicity: Managing Complexity

Reducing complexity improves software maintainability, which is why simplicity should be a key goal for the systems we build.

This does not necessarily refer to reducing the functionality of a system, it can also mean reducing?accidental?complexity. Complexity is accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

Abstraction?is one of the best tools that we have for dealing with accidental complexity. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand fa?ade.

Evolvability: Making Change Easy

Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.


References and credits:

  1. Designing Data Intensive applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
  2. GIF Images : GIPHY - Be Animated
  3. Title Image: Video by Magda Ehlers from Pexels


要查看或添加评论,请登录

社区洞察

其他会员也浏览了