Don’t make a mesh (unless you have to…)
Chris Pedder
Chief Data Officer @ OBRIZUM | Board advisor | Data transformation leader | Posting in a personal capacity.
Apologies for the punny title, it’s a bit clickbaitey, but I want to talk a bit about one of the current hypes in software and data; meshes. In software engineering, meshes are everywhere. The first iteration of this new approach to design came as a logical extension of the software architecture world moving towards micro services. In case you don’t know, microservices is a way of splitting up the components of your software system into atomised pieces, which are responsible for a single piece of the puzzle. Essentially this is object-orientation at the service level.
As an example, imagine want to write a piece of software to do bank transfers. Say we want to:
In a monolithic application, each of these steps might be done inside of a different object in the code, but ultimately the code runs as one piece on a single physical or virtual machine. If one part breaks, everything breaks. Inside a micro services architecture, each of these jobs would be handled not by simply a different object, but actually a different “service” which lives independently from the rest of the routines.
Of course, there’s no such thing as a free lunch, so in return for the oft-vaunted advantages of microservices (reliability, scalability, adaptability), we should expect that there’s a downside too. The main overhead is that we must ensure services can communicate?
In practice, this has meant developers becoming expert in API calls (or other, faster interfaces like gRPC), and also in the extensive use of SSL certificates or API tokens to allow services to communicate securely. Since a lot of these things are (fairly) standardised, this has meant an explosion in the amount of code (mostly boilerplate-ish) which is needed to deploy a service. If only there were a better way…
领英推荐
Enter, stage left, the service mesh. It’s really a set of pre-built plumbing (as the name suggests) which you can plug your microservices into. By default, everything is off, so you have to define a set of rules which allow services to communicate with one another. There are two important scaling factors here - the overall number of services, and how many of the other services a particular one is likely to communicate with. It’s a clever idea, and it takes the load off as long as your overall number of services in the mesh is not so few that you could get by just directly making API calls. But there’s another potentially hidden complexity, you want to architect your system such that you don’t need to touch thousands of rules to add or refactor a service. This idea of “weak coupling” is key to good architecture for microservices, so that’s not usually too much of an issue. Worst case scenario, there are some more strongly coupled components, but those couplings are explicit, so you can track them easily, right? That’s what the service layer is all about.
The idea behind a data mesh is essentially similar, once we restate what a service mesh does in a business-oriented way. A service mesh abstracts away the business logic of a micro service (I.e. who it needs to talk to to achieve its aims) from the code itself. The same idea applied to data says “let’s separate the data storage and operational databases from the information we might seek to find within it through analytics”. Since finding meaning in data is still very much a human-level intelligence task (I have seen scant evidence we will automate data analysts any time soon!), this requires that human beings create the rules for what that data represents. As you might have guessed from the outro to my previous paragraph, this is where couplings can become a problem.?
Whereas API calls in the service layer can be built to be a priori as uncoupled as possible, the data layer is a more complex beast, and there can be strong but implicit couplings in it. You might think changing the way you collect one variable being slightly changed will have only very local effects in your mesh, only affecting the obviously-correlated components. But you can be wrong in subtle ways, and worse still - there aren’t really good ways to test for that. So you really need diligent, highly data-literate people manning the gates when it comes to defining the components of your data mesh. Just to be clear, I’m absolutely not saying that data meshes don’t work, I’m saying that they are more complex beasts that require more maintenance to look after than service meshes. They’re not for the little guys (like where I work) right now.
This brings me to the latest part of the mesh revolution - the ML/AI mesh. This is essentially viewed as an extension of the service mesh, where services can include interacting ML models. The problem here is that we don’t just have a service layer (where model inference sits) or a data layer (where the model source data comes from), we also have a deep, implicit coupling between the two. Machine learning models have to be trained. So the data flowing in the data layer directly affects the performance of the services in the services layer. To be more explicit, let’s consider what happens when we spot an opportunity to improve model performance. We decide to retrain a root-level ML system, using a different architecture to get a 5% uplift in performance of that model. The new model does better in the task, but it introduces new biases into the models downstream of it, which make them perform worse. So we need to retrain that model too. You can see where I’m going with this - we end up with (potentially) combinatorial complexity each and every time we choose to retrain one piece of the architecture. Scary stuff.
So, in conclusion: meshes work best in situations where their components are weakly coupled, and coupled only to a few other building blocks. In general, data systems don’t have this property, and so don’t scale in a helpful way when the meshes get large and changes have to be made. Is that the end of the story? Well, to give the classic data scientist answer: it depends. If you can find a way of making your systems weakly coupled, and you can operate in a zone where the combinatorics isn’t too murderous (the Goldilocks zone: not too big, not too small!), then meshes can also work in the data space. They are just a tool, at the end of the day, and we all know that if you find yourself using a rake as a hammer, you’ve probably fallen for the marketing…?
Chief Data Officer @ OBRIZUM | Board advisor | Data transformation leader | Posting in a personal capacity.
2 年Should have said originally, big HT to Debmalya Biswas?for starting me thinking about this!
AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA
2 年Thanks Chris, very interesting article - one of the very few articles (that I have seen) trying to bring together the overlapping concepts of API/Services Mesh, Data Mesh & AI/ML Mesh. As you rightly pointed out AI/ML mesh is the most complex as it extends the Service & Data Mesh with "interacting ML models". The combinatorial complexity further increases when we are not only considering a performance improvement of an existing model: "using a different architecture to get a 5% uplift in performance of that model" but a composition of existing services, e.g., a Computer Vision & NLP model, with a new Service layer on top, or reusing their combined training?+ inference?data to train a new model. https://www.dhirubhai.net/pulse/ai-mesh-future-enterprise-debmalya-biswas/ On the +ve side, this would enable maximum reuse & agility in enterprise use-cases. So interesting times ahead I guess :)