Multi-tenant SaaS Architecture - the Right Way
Introduction?
A few years ago I founded a start-up and created a SaaS product. It was a multi-tenant architecture with multiple user pools. When I re-evaluated the system architecture we used, I realized that it was not as robust as possible. It could have been better. If I were to repeat the same exercise, I would use the emerging architectural patterns, tools, best practices, and solutions. I would also take advantage of the PaaS and IaaS providers that we have today. Here is another take on the architectural solution to a SaaS-based multi-tenant solution that we built a while ago.
Problem statement: How to build a safe and scalable multi-tenant SaaS solution?
Intro, again: One can deploy a SaaS app in a siloed mode where the app, the access to the app, and the storage are dedicated to a single tenant. One can also deploy the app in a pooled mode, where the app, and storage, including access control, are shared across multiple tenants. There are obvious pros and cons. While sharing resources across tenants lowers costs in a pooled implementation, data isolation is better in siloed deployment. In yet other situations, a hybrid method is used where certain user pools (like say, Free users or Trial users) would be in a pooled service, and Paid or Special user pools are part of a siloed implementation.?
Before we get into the conversation of best architecture, I would like to talk about the historical tension between Minimum Viable Product (MVP) and Minimum Viable Architecture (MVA). While there is immense discussion in the industry about MVP, not much is written about MVA. In my opinion, MVA reinforces MVP. It (MVA) also enhances customer outcomes and experiences. In the name of MVP, one cannot sacrifice MVA. Both need to be balanced.?
In this article, I would like to discuss the architectural solution modification in the context of Tenant onboarding, Tenant management, User management and experience, Operations and metering, User authentication and authorization, Security, Application, Microservices decomposition and strategies, Data partitioning, Tenant isolation, Tenant provisioning, Provisioned environment, and Tech Stack.
Tenant Onboarding?
This first step is onboarding a tenant. What is a basic tenant onboarding experience? Let's say you are a new tenant. You go to the website, give your details, share what tier you belong to, and sign up with your details. You get an email stating that the registration is a success and it provides you with a temporary password. Once you click the link in the email, the link will take you to the app site, allow you to log in, and prompt you, as a first step, to change the password. This is a standard experience and a typical one.?
Many PaaS providers like Amazon provide these basic functionalities in their standard tools, for example, in AWS, Cognito provides these basic functionalities.?
But there is heavy lifting happening here. The moment a tenant tries to register, a service, let's call it Tenant Registration Service, kicks in. This Tenant Registration Service orchestrates all other features that register the tenant and creates the essential security, authentication, identification, and relevant policies.?
User Management?
Though the tenant may look like a single person, in SaaS parlance, a tenant is an org that has a bunch of users i.e. a user pool that comes in a variety of flavors. So, in the first step, you create the user and relevant identity profile for the user by calling User Management Service. In this step, you create a (i) user pool, (ii) admin pool, (iii)? maybe, an identity pool, and (iv) any special claims or provisions. While the users are created using a single tool like Cognito in AWS, an OpenID Connect provider is used to provide seamless integration and binding of user identity with a tenant. In a SaaS environment, a user is always bound to a tenant. In other words, a user identity is always in the context of a Tenant ID.? A User Pool is a set of users that has a set of predefined policies that are associated with the users. User Pool policies include MFA (Multi-factor authentication). Identify pools allow, for example, federated identity policy.?
We are not done with user management yet. How do we get tenant isolation? We could get it by provisioning IAM (Identity and Access Management). IAM allows us to create all the possible user roles (and related policies) and admin roles (and related policies).?
User Experience?
When building a SaaS multi-tenant app, it is important to architect the system with a view of the user experience. The tech stack and implementation should be such that the user interface gives a great experience. At the same time, the design of the UI should cater to the experience of (Tenant) Users, System Admins, and Tenant Admins. Authentication for each of these types of users is important due to the access limitation being different for each of the users and the impact of any accidental misuse on the operations.??
Tenant Management
Once we are done with the user registration, we get to the tenant registration by calling Tenant Management Service. This service creates a tenant with a Tenant ID, Plan, and relevant status (active, inactive, suspended, etc.).
Operations and metering for Metrics/Analytics, Management, and Billing. Every SaaS application needs to have access permissions under control and tracking. Metrics and Analytics help the SaaS providers and tenants to plan, track, and judge additional plans based on the usage by users.?
Tracking user access to various services and systems in the SaaS app is important. Let’s say we are building a retail SaaS system where multiple tenants are selling their wares and multiple users visit the site to place orders. In metering, users access systems like Order Management, Product Catalogues, Delivery Time, and Price Comparison, through API Gateways which in turn talks with a Tenant Manager. Metering is different for each tenant with specific SLAs and hence a Tenant Manager with different policies needs to trigger the right metrics.?
Once a tenant is created and user pools are created, it is only logical for the SaaS provider to create a billing account for the tenant to track. There is one drawback to creating a billing account in a single instance. For any reason, if the billing account is not created, then user pool creation and tenant creation may get stuck and give an unpleasant experience for the users and tenants. They cannot create an account, cannot log in, and cannot start using the system. We found a best practice here. To detach the billing account creation from the tenant account creation and user creation, it would be a good idea to drop the billing account creation into a queue for a Billing Account Manager to create an account. Even if there is a delay in the creation of the billing account for a few minutes or hours, the customer experience is not unpleasant. Based on a cron job or its availability, the Billing Account Manager service would kick in and create a Billing account for the Tenant and bind it with the tenant. Additionally,? many times Billing systems are part of third-party systems and it is a good idea to have some fault tolerance built in.?
Authentication and Authorization?
Once you create the user and admins for various tenants, it is time to log into the system and we need to define the system that facilitates log-in. Identification is such a fundamental part of SaaS that an organization needs to spend enough time on it. Also, identification for a SaaS application cannot rely solely on the standard identification solution offered by the provider. Identification weaves itself through the entire experience of the SaaS application for the user. Identification allows access, influences how the partitioning is implemented, and defines the app services.?
Let's say that you go to the web app and enter the credentials. The web app would redirect the info to an Auth Manager for authentication. The Auth Manager first checks with the User Manager to understand the Auth policy including (i) which user pool, (ii) which tenant (iii) applicable restrictions, like the number of times the user or tenant has access to this system, etc. With a successful return from the User Manager, the authorization is sent to the authentication provider. As an example, Cognito offers this service in AWS. Similar tools exist with other service providers. If the credentials are right, the Auth Manager returns a JWT ID token (Jason Web token). This single JWT token has both the user identity and tenant identity. This single token will let you access all the services in the SaaS environment.?
Security
Two levels of security could be built into the SaaS system. At the app level, after the user is authenticated, another level of security that needs to be paid attention to is the tenant details that are used to allow user access. To get the precise access permission, once the user logs into the system and tries to access any service, say Get The List Of Products, the Product Management Service that is managing access to the list or products will ask for a Tenant ID from a Token Manager. Once you get the Tenant ID, the Product Management Service will ask the Token Manager to share the credentials of the user. With the credentials, the user will be allowed to access the right data. This security process is within-app security. You can get another layer of security at the API Gateway. After the user logs in with appropriate credentials, a custom authorizer would further validate the credential to identify what microservices the user can access, which could be defined in a policy. As an example in a retail SaaS application, such potential services could be Cart Service, Catalog Service, Billing Service, Credit Card Service, Order Splitting Service, etc.? One can implement a Lambda function to get this additional security at the API Gateway.?
Application?
How does one develop a multi-tenant SaaS app that (i) stands the test of time and scale, (ii) is multi-tenant aware, (iii) allows addition/deletion of tenants, (iv) is in continuous evolution to cater to different and changing needs of all the tenants, and (v) addresses diverse needs of multiple types of user pools? While many architectures are available to design a SaaS application, microservices are the most popular among them. We will discuss how to design an SOA (Service Oriented Architecture) for a SaaS app and define microservices.?
Microservices Decomposition and Strategies?
领英推荐
One of the very difficult acts to complete is the appropriate decomposition of the app into microservices. Microservices need to be small, help in the agile development and deployment of solutions, reduce downtime, align with the dev-ops processes, and optimize the compute and storage resource consumption. The right decomposition strategy would help organizations focus on new features and solutions for better customer experience, rather than focusing too much on the upkeep of the system and system repair. Microservice decomposition in a single-tenant model is relatively easy compared to a multi-tenant environment. In siloed or single-tenant operations, t customer behavior is predictable, no accidental access by a nosy neighbor, and the policies to scale the solution are simpler.?
App decomposition is complex in a multi-tenant environment.? Here, microservices are shared by all the tenants. Also, tenants come in different shapes and sizes (in their user pools, services, access policies, size of the database, and types of users like paid vs. free vs. enterprise, etc.). Tenant isolation affects the decomposition strategy. Also in a multi-tenant environment, the activities of the tenants are not necessarily predictable. Rapid scaling of some tenants may pose problems for the entire ecosystem. New tenants are being constantly onboarded that may have newer expectations and usage styles. How do we minimize the possibility of a service going down? How to prevent that service from taking down other services or the entire system? Below are some strategies for monolith app decomposition into microservices.????
Strategy 1: Domain: Decompose based on your standard business domain components. For example, in a retail system, you have domains like Customers, Products, Vendors, Purchase process, Purchase Cart, Reports, Alerts, etc.?
Strategy 2: Sub-domain: Look at sub-domains in your business and replicate them into microservices. Extending the above example of a Retail System, the subdomains in the Product domain would be, say, a List of Products, Product Availability, Product Delivery Time, Product Price, etc.?
Strategy 3: Event storming. Identify real-life events that could happen and check if the system could respond to such events, For example, in the above retail scenario, if the customer is placing an order for ten different items, does the system gracefully (i) check the cart contents (ii) check the product availability (iii) calculate cost, tax and delivery time (iv) process order (v) send confirmation (vi) track order and (vi) close order.
The above strategies do not consider certain types of expectations. Expectations like (i) Fault Isolation and tolerance, (ii) Data and Tenant Isolation, (iii) Real-time and Batch operations, (iv) Bulk operations (a tenant intentionally or accidentally saturates the API point by accessing the service an unusually high number of times and hence saturating it), (v) Resource Optimization, (vi) SLAs & Metrics, (vii) Types and Tiers of Tenants (Free, Paid, Premium, Legacy, Platinum, Basic) (viii) Compute model (if some services are triggered excessively more frequently than others) and (ix) Data Partitioning, are not explicitly demanded by customers. Each of the above expectations needs a separate decomposition strategy.????
When asked “When is painting a classic considered done?”, Leonardo da Vinci said, “It is never done. You just abandon it”. Similarly, when is a decomposition exercise considered done? It is never done. There are always newer situations, newer design patterns, and new tools available that make you reconsider your current decomposition. Designers are always on the lookout for the splitting or merging of microservices.??
One of the pioneers of microservices is Netflix and I would like to discuss how the architecture existed in the year 2000 and how it moved to microservices-based architecture. Although I share Netflix’s journey from monolithic to microservices, many companies that are starting now begin with microservices architecture. Nevertheless, I see there are many examples of monolithic systems that exist even today that are successful and are waiting to transfer into a microservices architecture.?
In early 2000, Netflix was already a leader in subscription internet TV service that would license or produce Hollywood, independent, and local content. It also slowly grew to create original content. In 2000, Netflix had approximately 86 million subscribers with services accessible in around 190 countries and in multiple languages. The service would be accessed by thousands of device platforms which was indeed the most complex piece of the puzzle.?
In 2000, when a subscriber clicked “play a movie”, the client accessed the system through a load balancer. The Netflix system was hosted on a Linux host that runs (i) an Apache web server, (ii) Tomcat acting as a bridge between the webservers and the application, and (iii) one single massive application called Javaweb. This was connected to a massive Oracle database which was connected to other Oracle databases. This was true monolithic architecture. The system was monolithic and the database was monolithic. Any code change would take time to upload and test. There would be a memory leak that took a couple of days to detect. When you found one, you needed to undo the changes and start testing again. Meanwhile, other changes were being pushed by developers. Thus, there was a constant modify-test-undo-test scenario happening.??
The database was a bigger issue in Netflix. If the monolithic database went down, then everything went down. If a subscriber could not watch a movie, the entire Netflix experience was in jeopardy. This would affect business. Imagine the impact on the customer experience when families sat down to watch movies together. As more content was made available and as more people watched movies, Netflix would need a bigger database which, in turn, would make the system more vulnerable to a single point of failure.?
The monolithic system was a difficult piece to manage because everything was interconnected. There were direct calls to the databases and there were many applications directly referring to table schemas. Adding a column to a table would be a massive and delicate project.? Check here, here, here, here, here, and here for additional info on Netflix architecture.
Data Partitioning
Storage of data across multiple tenants allows data isolation such that there is absolute data privacy and safety that prevents accidental or intentional access of data by the wrong user.?
Data partitioning, though, is more important for data privacy and security, but less complicated to implement. Data partitioning is less of an actual partitioning issue and more of a data security issue.?
When we store data of various tenants in a multi-tenant SaaS system, there are a few potential styles one can store data. In the most basic style of data storage, the data of all the tenants' users are stored in a single database and schema. Data separation occurs based on the User ID and Tenant ID. The next higher level of a successful version of data storage is if different tenants have different tables in the same database and a similar schema. Another variation is that tenant data is stored using a different schema but in the same database. The usage of a single database lowers costs but increases the risk of business impact due to a database failure. Sharing of various pools of users of different tenants would also be used for scaling purposes when, for example, the unpaid customer pool is growing at a faster pace and hence the data needs to be more frequently sharded than, say, paid customers. Data storage in multi-tenants is most secure and more durable when every tenant has their own database. Such an arrangement manages spikes in access and unexpected data growth, though the costs are higher due to separate database licenses and management.?
Tenant Isolation?
Consider this. Could we accomplish tenant isolation by implementing identification techniques? Many think that identification techniques sufficiently prevent users from accessing data from a different tenant. You might think it would suffice to use identification techniques to correctly identify the user and figure out the right access permission. That is not true. Further isolation techniques need to be implemented to safeguard tenants' data from accidental access by wrong users.? Additional isolation techniques beyond authentication techniques need to be used to firm up data protection across tenants. For example, we can map the User ID with a Tenant ID and an Account ID. User roles could also be bound by federated identify policies that allow user pools to access multiple services using a single login.
Tenant Provisioning?
Another important architectural component that designers pay attention to is Tenant Provisioning. In the case of a multi-tenant with multiple user pools, provisioning addresses access to (i) apps and services, (ii) database, (iii) policies regarding sharding, (iv) policies regarding the movement of data across databases, etc.?
Provisioned Environment?
Let's say that you have developed the SaaS app. What does the provisioned environment look like? Let's take an example of an AWS environment. One can pick a couple of zones for high durability, availability, stability, and fail-over. For the application hosting, we used Amazon S3 with the API gateway that allows streamlined access with Lambda, the custom authorizer. Next, we have a public and private subnet, a typical best practice. For example, in AWS, the NAT gateway is a? sort of public subnet, and from there the app load balances to the private subnet where the main application services are hosted. These app services are hosted in ECS services that could auto-scale. This is typical app architecture one sees when you must have HA (high availability) app services.?
Tech Stack
If I were to revamp the architecture of my old app, I would use the following tech stack:?
?
Conclusion
I want to leave you with these thoughts. When you are about to start the journey of developing a multi-tenant system, start with a broad vision of the future of the app and the customers. When you are working on the architecture make sure that you have clear thoughts on Data Isolation, Tenant Isolation, Metrics, Authentication, and Authorization policies. I believe that Architecture is a journey and it never ends. Remember, today’s microservice is tomorrow’s monolithic.?
Business & Technology Consultant, Merchant Services
4 个月SK, Good information to share. We need to talk.