Don’t make a mesh (unless you have to…)
Meshes and mountains.

Don’t make a mesh (unless you have to…)

Apologies for the punny title, it’s a bit clickbaitey, but I want to talk a bit about one of the current hypes in software and data; meshes. In software engineering, meshes are everywhere. The first iteration of this new approach to design came as a logical extension of the software architecture world moving towards micro services. In case you don’t know, microservices is a way of splitting up the components of your software system into atomised pieces, which are responsible for a single piece of the puzzle. Essentially this is object-orientation at the service level.

As an example, imagine want to write a piece of software to do bank transfers. Say we want to:

  1. receive a customer’s request to make a bank transfer
  2. check that it is unique
  3. scan it for indicators of fraud
  4. register the new amount to the person being transferred money
  5. debit money from the account holder’s balance and finally
  6. reply with a success message.

In a monolithic application, each of these steps might be done inside of a different object in the code, but ultimately the code runs as one piece on a single physical or virtual machine. If one part breaks, everything breaks. Inside a micro services architecture, each of these jobs would be handled not by simply a different object, but actually a different “service” which lives independently from the rest of the routines.

Of course, there’s no such thing as a free lunch, so in return for the oft-vaunted advantages of microservices (reliability, scalability, adaptability), we should expect that there’s a downside too. The main overhead is that we must ensure services can communicate?

  1. Fast
  2. Securely.??

In practice, this has meant developers becoming expert in API calls (or other, faster interfaces like gRPC), and also in the extensive use of SSL certificates or API tokens to allow services to communicate securely. Since a lot of these things are (fairly) standardised, this has meant an explosion in the amount of code (mostly boilerplate-ish) which is needed to deploy a service. If only there were a better way…

Enter, stage left, the service mesh. It’s really a set of pre-built plumbing (as the name suggests) which you can plug your microservices into. By default, everything is off, so you have to define a set of rules which allow services to communicate with one another. There are two important scaling factors here - the overall number of services, and how many of the other services a particular one is likely to communicate with. It’s a clever idea, and it takes the load off as long as your overall number of services in the mesh is not so few that you could get by just directly making API calls. But there’s another potentially hidden complexity, you want to architect your system such that you don’t need to touch thousands of rules to add or refactor a service. This idea of “weak coupling” is key to good architecture for microservices, so that’s not usually too much of an issue. Worst case scenario, there are some more strongly coupled components, but those couplings are explicit, so you can track them easily, right? That’s what the service layer is all about.

The idea behind a data mesh is essentially similar, once we restate what a service mesh does in a business-oriented way. A service mesh abstracts away the business logic of a micro service (I.e. who it needs to talk to to achieve its aims) from the code itself. The same idea applied to data says “let’s separate the data storage and operational databases from the information we might seek to find within it through analytics”. Since finding meaning in data is still very much a human-level intelligence task (I have seen scant evidence we will automate data analysts any time soon!), this requires that human beings create the rules for what that data represents. As you might have guessed from the outro to my previous paragraph, this is where couplings can become a problem.?

Zhamak Deghani on Data Meshes

Whereas API calls in the service layer can be built to be a priori as uncoupled as possible, the data layer is a more complex beast, and there can be strong but implicit couplings in it. You might think changing the way you collect one variable being slightly changed will have only very local effects in your mesh, only affecting the obviously-correlated components. But you can be wrong in subtle ways, and worse still - there aren’t really good ways to test for that. So you really need diligent, highly data-literate people manning the gates when it comes to defining the components of your data mesh. Just to be clear, I’m absolutely not saying that data meshes don’t work, I’m saying that they are more complex beasts that require more maintenance to look after than service meshes. They’re not for the little guys (like where I work) right now.

This brings me to the latest part of the mesh revolution - the ML/AI mesh. This is essentially viewed as an extension of the service mesh, where services can include interacting ML models. The problem here is that we don’t just have a service layer (where model inference sits) or a data layer (where the model source data comes from), we also have a deep, implicit coupling between the two. Machine learning models have to be trained. So the data flowing in the data layer directly affects the performance of the services in the services layer. To be more explicit, let’s consider what happens when we spot an opportunity to improve model performance. We decide to retrain a root-level ML system, using a different architecture to get a 5% uplift in performance of that model. The new model does better in the task, but it introduces new biases into the models downstream of it, which make them perform worse. So we need to retrain that model too. You can see where I’m going with this - we end up with (potentially) combinatorial complexity each and every time we choose to retrain one piece of the architecture. Scary stuff.

ML Mesh from Google cloud

So, in conclusion: meshes work best in situations where their components are weakly coupled, and coupled only to a few other building blocks. In general, data systems don’t have this property, and so don’t scale in a helpful way when the meshes get large and changes have to be made. Is that the end of the story? Well, to give the classic data scientist answer: it depends. If you can find a way of making your systems weakly coupled, and you can operate in a zone where the combinatorics isn’t too murderous (the Goldilocks zone: not too big, not too small!), then meshes can also work in the data space. They are just a tool, at the end of the day, and we all know that if you find yourself using a rake as a hammer, you’ve probably fallen for the marketing…?

Chris Pedder

Chief Data Officer @ OBRIZUM | Board advisor | Data transformation leader | Posting in a personal capacity.

2 年

Should have said originally, big HT to Debmalya Biswas?for starting me thinking about this!

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

2 年

Thanks Chris, very interesting article - one of the very few articles (that I have seen) trying to bring together the overlapping concepts of API/Services Mesh, Data Mesh & AI/ML Mesh. As you rightly pointed out AI/ML mesh is the most complex as it extends the Service & Data Mesh with "interacting ML models". The combinatorial complexity further increases when we are not only considering a performance improvement of an existing model: "using a different architecture to get a 5% uplift in performance of that model" but a composition of existing services, e.g., a Computer Vision & NLP model, with a new Service layer on top, or reusing their combined training?+ inference?data to train a new model. https://www.dhirubhai.net/pulse/ai-mesh-future-enterprise-debmalya-biswas/ On the +ve side, this would enable maximum reuse & agility in enterprise use-cases. So interesting times ahead I guess :)

回复

要查看或添加评论,请登录

Chris Pedder的更多文章

  • Conform to be free.

    Conform to be free.

    As a sometimes awkward, sometimes I’m sure downright frustrating teenager, who just wanted to be, I always remember my…

    4 条评论
  • What is emergence in neural networks?

    What is emergence in neural networks?

    Large language models & emergence. If you’re reading this, I don’t need Bayes’ theorem to tell me that there’s a very…

    10 条评论
  • How to survive ML research

    How to survive ML research

    How (and why?) to stay ahead. I’ve seen numerous articles about how to “stay ahead” in ML research in the last two…

    5 条评论
  • Why “speed” is a bad metric for success.

    Why “speed” is a bad metric for success.

    To start, two aphorisms: “If you want to go fast, go alone. If you want to go far, go together” - African proverb.

    3 条评论
  • Why I love UX/UI as an ML engineer.

    Why I love UX/UI as an ML engineer.

    “There’s a truth, universally accepted, that an AI startup in posession of funding must be in search of good UX…

  • Building a data company in 2022.

    Building a data company in 2022.

    I've had a pretty varied career in machine learning and software development. I've worked for ten person startups and…

    6 条评论
  • What I learned from my first year in an innovation team.

    What I learned from my first year in an innovation team.

    I have spent the last year as part of Cisco's internal innovation program. As a result, I have read a lot of books and…

    3 条评论
  • What makes NLP hard (and fun).

    What makes NLP hard (and fun).

    So it's 2020, and the much-anticipated AI-powered robot uprising is still very much in the indiscernible mists of the…

    1 条评论
  • The "A" in AI?

    The "A" in AI?

    There’s really only one possible interpretation, and it’s “artificial”, isn’t it? For a long time, people would have…

  • "Fail fast" vs Machine learning.

    "Fail fast" vs Machine learning.

    Yep, you read that right. There can be only one.

社区洞察

其他会员也浏览了