Many times I am asked what is the best way to design a system. My point of view below:
- Executive sponsor: Know who the exec sponsor is and get their blessing for the initiative. The sponsor can be a client exec, your line of business exec, internal exec from a consuming organization, etc. Without this sponsorship everything else will fail
- Collect your use cases. At IBM we have a use case liabrary across industries to prevent re-inventing the wheel. Again the use cases must come from users at all strata in the organization. Be exhaustive by collecting all Universe of possibilities.
- Separate the wheat from the chaff by winnowing down the use cases to what really aligns with their position and aspirations in the marketplace. Come with a smaller distilled set of use cases and put them in the product backlog
- Almost all solutions are distributed these days. Re-lean the CAP theorem and understand what the performance goals are for the selected use cases. Since we will have partitions in any given environment we can have consistency OR availability but not both
- List the non-functional requirements:
- Performance
- Peak load scalability
- Rate limiting on APIs
- Third party API integration
- Security - Zero trust or otherwise, real time monitoring, security operations center
- Resilience
- Authentication
- Logging
- Monitoring
- Tracing
- Elasticity
6. Define major system constraints and boundaries: define the traffic and peak load on your system that your packaged or custom software would be handling. Are these constraints on performance, usability, scalability, etc.
7. Define the POC: Define the high level abstract design, system layers such as UI, Application, Data, Backups, and high level communication protocols between these layers.
8. Identify performance, data bottlenecks and dependencies to arrive at a scalable design components list such as:
- Number and type of web servers that must be connected to the internet
- Load balancer/Proxy between UI and Application layer for evenly load distribution across application servers. AWS has both a level 4 Network load balancer and level 7 Application load balancer
- Horizontal Scaling out based on the defined performance and usability baseline
- Defining backup layer for databases supporting high availability.
- Distributed databases / Database partitioning and sharding
- Internal load balancer between app layer and DB layer
- Cache at the UI layer, Cache for common queries, readahead for common queries, Cache for common data, Cache Invalidation based on least used or least recently used (LRU)
- DB selection. For e.g. data structure friendly Redis for in memory caching and Mongo DB for persistence
- List of third party integration with APIs.
9. Select the Platform given the IT policy, budget constraints and preference for a Cloud
- On-prem vs. Cloud vs. Hybrid cloud, vs. in-situ processing of data
- Use of IBM Cloud Paks for rapid deployment of sets of functionality (such as MQ, API Connect and APP Connect) to create an integration platform with just one install of the Cloud Pak
- Wireframes for UI with IBM Carbon design system, Vue and Figma. [1]
- Global Load Balancing (GSLB) - Geography based load balancing allows the client to be directed to the optimal datacenter location.?[2]
- Failover load balancing will send all requests to the first host listed until the load balancer determines that particular host is no longer available. It will then direct traffic to the next node in the list in the order specified. [2]
- Create an internal load balancer and register the database servers, and app servers with this. Database servers receive requests from this internal load balancer.
- IBM Cloud offers classic application and network load balancers. For VPC infrastructure, there are two varieties of load balancers: application load balancers for VPC and network load balancers for VPC. For classic infrastructure, IBM Cloud offers several options including IBM Cloud Load Balancer and Citrix NetScaler appliances. [1]
11. Caching: There are four areas where Caching helps
- Performance. The primary requirement for any caching solution is to improve performance, even under high loads. Ideally, it should increase throughout and reduce latency. [3]
- Scalability. A system must respond to load changes promptly. In your fictional shoe company, sudden increases in demand might occur when you run sales promotions or at specific times of the year. Scaling should be automatic and occur without downtime.[3]
- Availability. Any caching solution must be highly available. This helps ensure that your apps can deliver at peak performance, even if component failures occur.[3]
- Support for geographic distribution. It's essential that a caching solution provides the same performance and scaling benefits everywhere in the world. This can be challenging if your data is geographically dispersed.[3]
- RDB?or “Redis?Database Backup”?creates point in time snapshots of Redis data
- RediSearch. Provides a powerful indexing and querying engine with a full-text search engine.
- RedisBloom. Provides support for probabilistic data structures.
- RedisTimeSeries. Enables you to ingest and query large quantities of data with very high performance.
- Many opensource tools have caching built in. For e.g. solr.search.?LRUCache , solr.search.?FastLRUCache, and solr.search.?LFUCache?.
- Use a distributed cache to Manage spikes in traffic, cache and provide commonly accessed data to users, help reduce compute load on your databases, locate content geographically closer to users and provide for output caching.
An example of cache management is in the Sterling OMS tool "The?Sterling Order Management?reference data caching is implemented by a?local,?simple,?lazy-loading,?asynchronous-refresh?cache manager.
The cache manager is a?lazy-loader?in the sense that it does not read in the cacheable reference tables at start up but would instead only cache records as they are being read. The benefit of the lazy-loading strategy is that data is only cached where they are needed.
The cache manager implements a?simple?cache management policy. Data that is cached remains in the cache until the cache manager is instructed to flush the cache. This could happen because the cache has reached a certain size limit or a reference data record was changed from a standard?Sterling Order Management?API. The cache manager does not implement cache management policies, such as record flushing using a least recently used algorithm, in order to avoid cache management overheads. In our controlled test, this?simple?cache manager provides significant performance benefits with little management overhead.
In keeping with the simple cache strategy, when a reference data record is changed by a?Sterling Order Management?API, the local cache manager notifies all the other cache managers to flush the reference data table. There is a small time-lag between when the reference data is changed to when the last cache manager is notified.
When the cache managers receive the change notification, the cache managers flushes all the cached entries for the affected table. As a result, you should cache tables that are infrequently changed."
- Use session store to help facilitate eCommerce shopping carts, store user cookies, maintain user login and session state data, and enable IoT telemetry.
12. Design Alternatives Considered with Pros/Cons and Costing: A table of alternatives and pros/cons is required to make an unbiased decision
13. If you build it, you run it.
In the new paradigm of cloud based developedment the design paradigm shifts to design, build, run, maintain. Here are the key areas to be addressed.
- Codebase?- use version control, one codebase tracked in revision control for many deployments. [5]
- Dependencies?- use a package manager and don't commit dependencies in the codebase repository.
- Config?- store the config in Environment Variable, if you have to repackage your application, you're doing it wrong.
- Backing Services?- a?deploy?of the twelve-factor app should be able to swap out a local MySQL database with one managed by a third party (such as?Amazon RDS) without any changes to the app’s code.
- Build, Release, Run?- the twelve-factor app uses strict separation between the build, release, and run stages. Every release should always have a unique release ID and releases should allow rollback.
- Processes?- execute the app as one or more stateless processes, the Twelve-factor processes are stateless and?share-nothing.
- Port Binding?- export services via port binding, The twelve-factor app is completely self-contained.
- Concurrency?- scale out via the process model. Each process should be individually scaled, with Factor 6 (Stateless), it is easy to scale the services.
- Disposability?- maximize robustness with fast startup and graceful shutdown, we can achieve this with containers.
- Dev/Prod Parity?- Keep development, staging, and production as similar as possible, the twelve-factor app is designed for?continuous deployment?by keeping the gap between development and production small.
- Logs?- treat logs as event streams, a twelve-factor app never concerns itself with routing or storage of its output stream.
- Admin Processes?- run admin/management tasks as one-off processes.
Conclusion:
Systems design for modern applications has more moving parts than ever before. A diligent approach will yield the positive results and a system that will scale, be fault tolerant, and perform per users expectations despite failures of individual nodes and peak loads such as holiday season or some catastrophe.
References:
- https://pages.github.ibm.com/w3ds/w3ds/?path=/story/vue_navigation-top--standard
- https://www.ibm.com/cloud/load-balancer
- https://learn.microsoft.com/en-us/training/modules/intro-to-azure-cache-for-redis/2-what-is-azure-cache-for-redis
- https://developers.redhat.com/blog/2018/06/28/why-kubernetes-is-the-new-application-server#empowering_your_application
- https://developers.redhat.com/blog/2017/06/22/12-factors-to-cloud-success
- https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#what-is-the-horizontal-pod-autoscaler