HockeyStick #13 - LLMops in Prod
HockeyStick #13 - LLMops in Prod

HockeyStick #13 - LLMops in Prod

It was amazing to host Abi Aryan ???? on episode 13 of HockeyStick Show !

We talked about her upcoming book LLMOps: Managing Large Language Models in Production published by O'Reilly Publications, how LLMOps is different from MLOps and ML Engineering, the challenges and unique requirements of managing generative models in production and much more.

Abi is amazing to talk to and I hope you enjoy the episode as much as I enjoyed recording it!

Podcast

Follow HockeyStick Show and find the episodes below:

Web: https://hockeystick.show

YouTube: https://www.youtube.com/@HockeyStickShow

Apple Podcasts: https://podcasts.apple.com/us/podcast/hockeystick-show/id1746365686

Spotify: https://open.spotify.com/show/2jHvvkUkHAU8GRRHKbwAyJ


Video

Audio

Summary

Join Miko Pawlikowski ??? on this episode of HockeyStick Show as he interviews Abi Aryan ???? , a leading expert and author on Large Language Model Operations (LLMOps), to distinguish it from Machine Learning Operations (MLOps) and Machine Learning Engineering (MLE). Abi delves into the challenges and unique requirements of managing generative models in production, discusses the evolution and future of LLMOps, and shares insights into her upcoming book, 'LLMOps: Managing Large Language Models in Production.' Gain understanding on safety, scalability, robustness, and the lifecycle of LLMs, and learn practical steps to effectively deploy and monitor these advanced models.

Transcript

Miko Pawlikowski: [00:00:00] I'm Miko Pawlikowski, and this is HockeyStick.

Miko Pawlikowski: Today, we're talking about LLMOps and how it iffers from MLOps, MLE, and other co acronyms. Do we need any discipline? How different is it really to work with large language models compared to any other piece of software? Why do models deteriorate over time? I'm joined by Abi Aryan, the author of LLMOps, Managing Large Language Models in Production, as well as What is LLM Ops, published by O'Reilly.

Miko Pawlikowski: Abby is a founder at Abide ai. Welcome to this episode and thank you for flying Hockey Stick.

Miko Pawlikowski: LLMOps versus MLOps versus MLE. Can you tell me what's the difference between the three of them?

Abi Aryan: So in very simple words, I think MLOps versus LLMOps. Those are framework. Machine learning engineering is a discipline or an engineering [00:01:00] practice, I would say. So it's more like a role or a practice. I would keep that separate from both of those ones. but let me define the differece between MLOps versus LLMOps. So most of the conventional machine learning models that we have seen till date were discriminative models, which is they were very predictable in the sense that they were making their inferences. The models that we are working with right now, large language models, they are generative in nature.

Abi Aryan: So one of the core differences between MLOps Versus LLMOps is what kind of model are we working with? Are we working with a discriminative model or are we working with the generative model? The big difference really happens because when we're talking about generative models, they're not really generating things at the same scale as conventional machine learning models.

Abi Aryan: The size is much, much bigger, because they need a lot of information to be able to create more information themselves. there are the big [00:02:00] problems: first is evaluation, and this second is basically the scale or the size of the markets. with conventional machine learning models, a lot of focus was 'let's collect the data', then 'let's do feature engineering'. it was very much experimental, we were trying to fit a model to a very specific task, but large language models are task agnostic, which is their more generalized model, there's a shift from building task specific software to building task agnostic software, and that's where large language models come into play. Anytime you're building any sort of unbounded solution, that comes with its own challenges. So I would say large language model operations is inspired from MLOps which is because it shares some of the same things. There's maybe some stuff that we are doing in the engineering.

Abi Aryan: Yes, we're still fine tuning the products, even though the fine tuning we're doing is very different, we can't really afford to update all of the weights of the model. So [00:03:00] we're using approaches that only update some of the weights, or we are using other techniques, for example, prompt engineering, which is new with these models specifically because again, updating all of the weights of the model during the fine tuning or the training process.

Abi Aryan: If we are to call fine tuning. very similar to training itself. It's very costly.?

Miko Pawlikowski: okay. So you're blowing my mind a little bit. I thought the answer was LLMOps is a niche within MLOps and, just leave it at that. But sounds like there is more to that. How much of that is fashion? People who work on the fancy new LLMs who don't want to be called, the old fashioned, machine learning engineers??

Miko Pawlikowski: it's a complicated thing, which is, I don't think you essentially ever work on a technology. There are people who work on technology for the sake of technology. I would call those people researchers, which is ML scientists, essentially are those kind of people. But then the next step is people who are working on technology because [00:04:00] it's solving a problem. So whether it's using a very simple decision tree or whether it's still using cat booster XGBoost, I don't think there needs to be much difference in terms of how people approach these kind of technologies itself, because the focus people need to have is 'this is the problem we are trying to solve. What kind of problem is it? Is it discrimative problem or is it a generative problem? If it's a generative problem, yes, I'm going to implement this technology, but it doesn't really mean that, both of those two fields are in competition with each other, I think they compliment each other. You picked LLMOps as the topic of your next book, and for anybody listening to this, the book will be very soon available in preview early access. It's called "LLMOps: Managing Large Language Models in Production".which takes quite a bit to pronounce all of this together. Tell us a little bit about how [00:05:00] you, came up with the topic of the book, the origin story.?

Abi Aryan: I always wanted to write a technical book. Manning approached me back in 2018 to write a book on interpretability. I didn't feel like I was ready to write a book back then.but the person who was the assistant acquisition editor, which is the person who reached out, is essentially my acquisition editor.

Abi Aryan: So she's now at O'Reilly.?

Abi Aryan: it's a very small place eventually, in terms of what inspired me to write a book, especially on this topic and not pick up any other topic for say, is basically seeing that shift, which is, as the scale is increasing with these models, yes, they're not really good at doing discriminative tasks right now, but eventually, because these are generalized models, we'll be eventually expanding on the capabilities of these models, but these models are not getting smaller anytime soon.

Abi Aryan: Because of the scale of these models, there will be a few questions for people to [00:06:00] asking, because we're interacting so closely with these models as compared to before earlier, the majorly the people who were interacting directly with the model were machine learning engineers and data scientists. Now, these are in the form of chatbots, which the entire user is interacting with it, and people are playing with it, trying to hack it. So while the field of security operations wasn't super relevant for a lot of other companies, now that has become the main center of this show in a way. Everybody can build a large language model and that's one of the core differences is,the focus was more like in MLOps, like, how do we build a model?

Abi Aryan: How do we host and, deploy it in production, which is how do we self serve these models. Now the focus has shifted, it's so damn easy To build a large language model for your particular application, you may not have to build it from scratch. You don't need to train a model from scratch.

Abi Aryan: You can put wrappers around it. You can put guardrails [00:07:00] around it. You can still fine tune it. You can integrate it with a RAG system and use it for your particular use case. So for me, understanding the fact that there's A big market of people who were software engineers, who didn't really have access to machine learning systems because they didn't have the skill setting, machine learning has always been posed like, Oh, my God, you need to know linear algebra to understand how these models work to now, where they can just give the API key and implement a machine learning model itself so that ease of use means the entire software engineering community or anybody who can code will now be able to build or host their own machine learning model and in this case, specifically, it will be large language models, but I recognize that shift. And I was like, this is a substantial shift. This is not just, this technology is limited to this very set of people. Now, so many people can use it. So many people can build on it. [00:08:00] And the fact that the market has expanded and also the fact that these models do present additional challenges as well.

Abi Aryan: This is the right point for us to write a book. Because it's a way bigger market than before.

Miko Pawlikowski: There's something really scary about putting a model in production and letting clients talk to it, when you never know what it's going to do for sure. Okay. And I understand the shift that you're describing. So who came up with the term LLMOps?,?

Abi Aryan: Basically, when I sent my proposal, I used that term. it was in, Last year, February, when I sent my proposal, and I use that term, the proposal was sent to a couple of reviewers who were like, 'Oh, we don't think it's going to stick'. And then eventually, I think Weights and Biases came up with their own blog post, which is what's really the difference. Then Arise came up with their own blog post, as in what's the difference between LLMOps and MLOPs. And eventually, [00:09:00] Everybody was like, 'Oh my God, this term is sticking'. And by then we had already signed up the contract with my editor. she took a gamble on me, which is, I said, this is going to stick because it's a substantial shift in what we're trying to do, in MLOps, the focus was different. Here, the focus is different. the amount of outages, the amount of reliability issues, the amount of, unreliability of these models is way higher than the conventional machine learning models that we were using. there are very few people who were doing distributed training, who understand that scope of problems. the engineering was not really done at the scale that is being done right now for large language models. I feel like this is going to be a big thing where there needs to be education.

Abi Aryan: And I basically went in to create that education in this space. if I was to start in the field today as a 17 or a 19 year old, what would [00:10:00] I want to learn? I come from like a background in maths and computer science and statistics. So I don't want people to feel like, 'oh, that's a barrier for me to get into machine learning'. No, that's not really a barrier.

Miko Pawlikowski: right? And so for anybody who is, like we said, the book is still a little bit, out. it's coming soon, but it's not available today. What is available is that new report that you authored. What is LLMOps? What's that? Basically to prepare people to start using that term to make sure that everybody's on the same page: 'okay, guys, we're doing LLM Ops. This is the term. Let's go with it'.?

Abi Aryan: so the reason the report came out was because we got very critical reviews from a lot of people early last year from people who were saying LLMs are not going to stick, LLMOps is not going to stick. So we were like, let's at least tell people what it is, and then if there's enough interest, we'll write the book.

Abi Aryan: Though we had signed contracts for both of the things, [00:11:00] but we were like, let's test out, if people really understand what's the difference. once people know why this is substantial, then we can take them to, how are you supposed to do it?

Miko Pawlikowski: what the report essentially does is what I would probably say we're a little bit late in the market where it has already stuck, which is people have already understood, there's a shift in terms of companies that are building their own large language model or generative AI teams, if I can use that word. The implementation has already started. They've already started looking at the issues. They've started realizing that they need a new discipline. so I'm having talks with a lot of companies on a consulting capacity, that are trying to figure out how to build,a specialized engineering practice around these models. What would the shift look like when it comes to these models? What would the team structure look like? What would the metrics look like? What are the key expectations [00:12:00] that they can get? And, if you want to keep investing in the space, then how do we justify that investment as well? How do we make sure we tie these models with our KPIs now, given the fact, these models are still a little bit unpredictable for a few people would call it unpredictable.

Abi Aryan: I don't particularly think they're unpredictable. They still exist. any inference that is being made does exist, and there are probably the space of the input data that you're providing to. So to me, while they're still very probabilistic model, but they're still a little bit like untameable in, if I can say it in that sense, which is, it's very hard to predict, if the model goes off and it's not because essentially the model is built that way, but it's because of the number of people interacting with it and, the way the models are being structured is basically to help the user.

Abi Aryan: there are so many people trying to hack the solutions. basically you're building a product for your [00:13:00] enemies, essentially.?

Miko Pawlikowski: okay. Naming is hard. It's probably the hardest problem in computer science, but we've got a term. I think at this stage we understand what it means. There's a report in case you want to prove to somebody, Hey, LLMOps means this. You can just point them to that. And the book is coming out soon.

Miko Pawlikowski: So let's talk a little bit about what LLMOps really is in practice. And I'm browsing through your report right now. And I see things like safety, scalability, robustness, the LLM lifecycle. Let's talk about this things a little bit. where should we start? What's the most, painful part of running LLMs today?

Abi Aryan: So I would say, the three goals are where we should ideally start this, which is why do we need this field, or why do we need this new practice? The first thing is essentially safety, which is making sure that the model is [00:14:00] playing by the rules. Because, again, it's not just machine learning engineers trying to build on these models today.

Abi Aryan: It's software engineers. It's a lot of other people as well. There needs to be a new playbook, for people who are working with these models. Because, again, the models do pose a lot of risk, which is yes, there's operational risk as well. But a lot of risk that people don't really understand, it's very easy to integrate code, integrate libraries, but a lot of people don't really think about supply chain risk, which is if I'm using a package from some website, is the package secure enough?

Abi Aryan: How do I make sure that, I'm not installing malware on my system. Those things are not really well understood, which is the entire field of, cyber security and security operations was isolated from practice. And now that has to become very key integrated into this. The second thing I would say is [00:15:00] scalability, which is basically making sure that the model does scale to the number of people that are interacting with the model as well.

Abi Aryan: We're essentially going from where maybe a couple of people were interacting with these models to a large number of people interacting by the minute, which is, you're not going to open AI chat GPT to write one thing, right? You're having a conversation, which may take about five, 10, 15, 20 minutes, and they're wearing workloads as well.

Abi Aryan: And they're wearing workloads from different locations. So we need to think about how do we make sure that the latency is fine? How do we make sure that the models are able to deal with the traffic if it's usual or unusual, and how do we build an architecture around making sure that the model can serve and can adapt to those requirements is the central thing, but also with the part that these models are so [00:16:00] huge, inferencing that every single time does cost you.

Abi Aryan: So how do we do caching? How do we do, load testing? How do we do performance testing? All of those questions become central that weren't really central before. then the next part is basically robustness. and by robust, the model keeps. Having the same kind of reactions, which is a conventionally we used to call it reproducibility, which is you can reproduce what was already there. Robustness is a little bit different, since a lot of people are building on closed source, a lot of people are building on open source, but the models behavior changes with how many people are interacting?it's getting a lot of light data as well. So there's some kind of model degradation that happens with time. Also, every single time the model gets updated as well, the behavior changes. So the entire prompt pipeline that you built up can break easily, which is a lot of companies eventually realized, mid last year that, the built up These very intricate prompt [00:17:00] pipelines, and eventually OpenAI does one update and those prompt pipelines don't really work anymore. So how do you build a system that keeps on being predictable in that scenario, which is any sort of infrastructure that you build on top of the model? It doesn't need to be rebuilt for every single iteration or every single time you're moving from OpenAI to let's say plot or to some other model as well, because you need to keep improving and making sure that you're working with the new data as well.

Abi Aryan: So the three questions that come over there are the questions of data drift, which is based on how the input changes over time, which is basically how many people are interacting with the model. and that causes one of the shifts in the model behavior. The second is concept drift, which is every single time, there's new information that comes out there. So a good example to give over there would be Corona used to be a beer brand. so any models that were built up till, let's say about 2019, [00:18:00] 2020, understood Corona as like a beer, so it would always reference an answer in that perspective, the models that are being, built now to understand that, it could be a beer or it could be, the virus thing.

Abi Aryan: So that is essentially concept drift. the prime minister, the president, or any new information that comes up where changes the behavioral functional, capabilities of the inputs that we've essentially, given or, adds additional information that changes the model behavior as well.

Abi Aryan: And the third is basically the prompter, which is the updates of the model, essentially. And how does the retraining of the model affect the performance of your entire infrastructure as well? For anybody who's building on closed source models, OpenAI, Cloud, and Entropiq and all of these companies, they're constantly using, our LHF techniques to retrain the models substantially.

Abi Aryan: So that does impact the model performance. I would [00:19:00] say these three are core things which are in the center. Anybody who's building with these models needs to think of is my model safe? Is my model scalable? Is my model robust? And if you're not looking those properties, it's very hard to build a sustainable product around these models.?

Miko Pawlikowski: that was a lot of information in one go. I've got questions. imagine you're talking to a five year old software engineer who has never done any AI, just, basic, software engineering things,as five year olds do, how different really is it, the safety part of it, compared to any other application?

Miko Pawlikowski: the few examples you gave, like using an unsafe library coming from somewhere, every piece of software on earth is going to have the same problem, right? What are the problems that are actually unique to LLMs, from the safety perspective and why?

Abi Aryan: This is one reason I think LLMOps is closer to DevOps [00:20:00] than it is to MLOps. Essentially, because DevOps is built up around so much software, so many frameworks, so many libraries exist out there, whereas in conventional machine learning models, we were using scikit learning.

Abi Aryan: So there were very specific libraries that were already tested. And, we knew that these are secure. We were using TensorFlow, PyTorch and all those ones. Now, because the open source community is very similar to how the software community is, so a lot of things do translate from what DevOps engineers were doing or where the focus of what conventional software engineers were looking at versus what LLMOps engineers would be looking at.

Abi Aryan: The key differences now. would be: anytime we're doing conventional software engineering. it's a rule based system where we define,what our code is supposed to do. Now we're moving away from a rule based system, which means, basically he model can create things that are [00:21:00] factually inaccurate as well.

Abi Aryan: Now those are things that you really need to cater for. So that's one of the big things, which is A: how do you deal with biases in the data? Second thing is how do you deal with factually inaccurate information? and so for a five year old, maybe it's not that significant, but for anybody who's doing software engineering, how do you make sure that you're not making decisions based on what the models are generating. For example, if the model says this is how this is supposed to happen or for business executives, if it says,based on the data,this is what the graph is looking like. And if that happens to be inaccurate, we can't really rely on that to make further decisions on where the strategic decisions we should be making next.?

Abi Aryan: Now then with models are exposed to that kind of risk, which is usually called this hallucination problem.

Miko Pawlikowski: got it. I guess it gets much worse when you've got things like autonomous agents, [00:22:00] right? When people directly plug things that have permissions to do things, into this LLMs andwe'll see how that goes. Okay. So I buy that argument. going back to the scalability, that's the bit that I don't think I fully understood when you were explaining. Why is it not the same as scaling any other request response server? how is it different other than, the practical part of it being massive and requiring a lot of resources? Why is it harder to scale an LLM, than it is to scale any other application??

Abi Aryan: at this point, you can probably say there are three kinds of applications out there. One is conventional software piece, right? Anytime we're writing software, we're trying to refactor, making it as small as possible or making sure that we're defining rules on, this is what happens when you do this.

Abi Aryan: This is what happens when you do this. That's how requests are [00:23:00] processed. But conventional machine learning models, the way that they are working, is, the applications they're used for are entirely different, mostly they're used for internal data capture, being used for recommender systems or semantic analysis.

Abi Aryan: the people who are interacting with the model outputs are different. Now, because large language models are customer facing. that necessitates some expectations that people have, the inference speed is always going to be high.

Abi Aryan: Now, with the inference speed, when you have so much data that you need to retrieve or run an algorithm to create new information based on whatever they've given, that is a very hard task, so with conventional software engineering, we were writing, those birds first search, all of those algorithms, They were still implemented on a small scale data, it was still very simple to do as compared to now we have these massive databases. [00:24:00] retrieving data and then generating information, both in real time, is a very hard task. Making sure that you can maintain the inference, speed, making sure that you can maintain the latency, making sure that you can make it the target, making sure that you can test, other things as well. before the model really passes information to the user, their guardrails being put in, which is there's one additional layer that's put in, there's evaluations that are being put in as well to make sure that the person isn't fiddling with the model or, it's not giving you information that's wrong. First part is retrieving the data and then generating information. The next part is making sure that, it passes through all of those layers as well, then still maintains the customer expectations. That's really hard.

Abi Aryan: And also, when the demand does skyrocket, or when they don't find an information, they can easily freeze up as well. So that becomes a very hard problem to solve, because that means the performance would [00:25:00] also degrade. then the next question is basically If a lot of people are using the model for one kind of things, making sure that the model can still answer or doesn't really adapt to only those kind of problems and can still go into the database and still look at, A very different problem and still perform well on that. that's hard. So the real challenges are basically the service disruption, the availability and that is usually a little bit harder.

Abi Aryan: majorly because, you need to have a lot of, parallel nodes as well that are trying to interact with these models. And then for companies that are hosting different large language models as well. no one large language model is optimal for every single kind of problem. it may not be cost optimal as well. So what is really happening is there's a micro service kind of architecture, which was. Prominent in [00:26:00] conventional software engineering. Do not so much in LLMOps. What really happens over there is then you're thinking about how am I doing parallel computing with all of these nodes and clusters that I do have? How do I do horizontal scaling? How do I make sure that all of my resources are being optimized, and one of my clusters isn't done while one is being really optimized to the maximum, you know, that is. stopping at one point in time. So those are key problems with these models now.

Abi Aryan: So?

Miko Pawlikowski: I get that, but I'm still not entirely sure why it's any harder than any other piece of software, like all the things that you just mentioned about scaling clusters and high availability, high throughput, making sure that all those things happen. That's problems that we've had for decades and that, all the other systems have in place, right? If you look at any enterprise ready application, there are layers and layers of things and they keep [00:27:00] working. So where is the actual difference coming from? Is it because of the fact that your query, your prompt, it's indeterministic in terms of resources and time it takes to answer that query? Is that really the biggest wrench in the works here?

Abi Aryan: so the biggest wrench is essentially just the size of the data that it needs to query. the size of the data has gone from, a few million parameters to now a trillion parameters. And every single time you need to look at a trillion parameters of information and then generate information, making sure that you're not overly relying on copying information from a single source, but you're building up a response from all different sources of information that you do have available that is time consuming.

Miko Pawlikowski: Got it. Okay.

Miko Pawlikowski: So,

Abi Aryan: it is the unpredictable. Nature of you don't know how much [00:28:00] data you're going to have to pull in basically Okay. And then you talked a little bit about the robustness. And if I understood correctly, you said something about: as people use this models, they deteriorate? so any single time people are interacting with these models, what really happens is we're asking a certain kind of questions over a period of time.as the model is answering those particular kind of questions, it starts learning new information as well. And with a period of time, it starts forgetting other information.

Abi Aryan: consider this, you basically started in high school, you learned a couple of subjects, you learned social sciences, you learned physics, you learned chemistry and everything, but now you're doing software engineering, which is that's the thing you're doing all day long. Now, if I was to ask you a chemistry question, it would take you a very long time and you'll [00:29:00] have to think and you may not be able to answer accurately on a chemistry question, compared to when you were exposed to that information on a daily basis. Now that's the same thing which is happening with large language models as well, which is based on the kind of Interactions they're having with these models, based on the inputs that they're getting from the users itself they can drift in a particular direction

Abi Aryan: You're?

Abi Aryan: completely blowing my mind. What I thought was happening is that once I've got a model trained, let's say that I download LLAMA-3, right? And I run it on my, computer. I thought that this were static weights that didn't. budge anymore, right? I was just sorting a query through it, getting some kind of output, and my model itself wasn't changing over time at all. that's if you're implementing the model as is, but the moment you implement it in production, it changes because now you're integrating real [00:30:00] data sources as well, but also with LLAMA models as well, which is if we keep interacting with the model for a couple of hours of time. Itit will look at the queries that were done previously to answer you questions quickly over that period of time, based on the last interactions, essentially,

Miko Pawlikowski: there's a particular reason for that. it's basically the same thing which is happening in our brains, which is the same neurons are getting fired over and over again. as some information gets fired over and over again, as some rates are being called over and over again, those rates do get higher priority eventually.?

Miko Pawlikowski: Okay, so why can't you just have a static set of weights for this model and not adjust them so that you don't have that problem? Why is it not enough??

Abi Aryan: because then it wouldn't be able to do domain adaptation, which is, it may work fantastically well for the idea said that you've provided it with, but if you [00:31:00] need to do something on top of that, which is implemented for your use case, then it can't really do that. And then again, the big question around: if we wanted to behave in a certain way, we want it to answer questions in a certain way, it wouldn't be able to have those capabilities either. So the whole RLHF thing with where we teach the model, this is wrong, this is right. That doesn't really happen. So there's essentially no learning happening. So the performance is static. It could be bad and it will deteriorate over time just because you know that this model wouldn't be able to generalize further for me. So it doesn't generalize with you as a person.

Miko Pawlikowski: I'm asking because I thought that you would just update that model of new data and have a fresh fine tuned or whatever updated version here and there. And, you would just replace it. But if you're telling me that this is how most people run this models, then I understand why this [00:32:00] is so scary. Not only have this model, but alsopeople interacting with it, they can break it, they can find a new way of going around your

Abi Aryan: hacking the model as well. Yes.

Miko Pawlikowski: And by design, you want it to be malleable and every conversation it has with someone is actually changing the model. That's like triple scary.

Abi Aryan: that's essentially why you need a new framework or why you need a new field that I was the core inspiration for as okay, this field has gone a little bit harder than it used to be.

Miko Pawlikowski: Okay. So not to ask you for spoilers or anything, but what can you do about that? what's your book going to introduce to make this stuff better??

Abi Aryan: I don't think I can make the stuff better, but,if you can measure something, then you can improvise it. Or you can see if something works. That's happening is an outlier [00:33:00] as well. So what my book really does, is give you ways to measure things, which is instead of just thinking about security, 'okay, I need to do X, Y, Z, blah, blah, blah, things', giving you a systematic framework to think about evaluations, which is, instead of implementing X framework or Y framework, which is let's say, instead of implementing just rove or blue score or anything that comes out tomorrow in the market, you really need to understand what am I essentially doing?

Abi Aryan: Why are these scores really helpful? What are the limitations of these ones, where do they essentially fail? What are the new things that can be implemented? What are the properties that those new things need to have? So I'm more. Building the field from that first principles thing, which is understanding what do you really need and for a lot of things that I'm introducing in the book, there isn't really a framework, there isn't really a technology out there. And a lot of things I say, there can be a software that can be built [00:34:00] around it. Nobody has.

Miko Pawlikowski: Okay. So that sounds like a good first step. Can I ask you in like a nutshell version of what's a life cycle of an LLM, like a modern one that you would see in production somewhere right now, typically looks like, because I'm just realizing I have holes in my understanding. It just blew my mind about the context drift. So can you walk me through what happens from the moment that, a company decides, 'okay, we need a model to do this because we really want our customers talk to something online how you add the domain knowledge to it. How do you evaluate it and how you integrate and then deploy and monitor the whole thing?

Abi Aryan: let me be very precise in saying this, which is the first step for anybody to implement these models is use a toy model or use something which already [00:35:00] exists and implement it as is and build evaluation metrics around your problem.so instead of trying to fine tune your model, or instead of giving it new data, just implement the model as is. use ChatGPT or something, and build evaluation metrics, which is what was I trying to measure around it? How is the model performing on these kind of tasks? So breaking those things down is the first step. Then it gets a little bit more intricate than that, which is once you realize these are the holes, or this is the data that I needed for the model to be able to answer, which is now I need the model to be able to answer questions about my company particularly, or about my product specifically.

Abi Aryan: Now you're going into data engineering, which is now you're thinking, what is the additional data I can provide to the model itself? And once you've done that then there's the whole pipeline of data [00:36:00] engineering that goes in which is now you need to think about how do you manage the noise? How do you augment the data?

Abi Aryan: How are you tokenizing the data? How are you making sure that there's no bias or toxicity in the data as well? Andhow do you make sure that the model doesn't really memorize something. So the way models memorize information is because some of the information occurs quite a lot of times.

Abi Aryan: So that is essentially called data deduplication. So making sure that there's no deduplication in the model itself. How do you sanitize the data, which is making sure that, there's no user information or any private information removed from the data that you're providing to the model itself.

Abi Aryan: So then once you have a set of evaluation metrics, then the next step. Which is the next stage for the company to go in, is implement the data engineering pipeline, then use the same model on it and then evaluation. [00:37:00] Once you've done evaluation on that one, then the next step is letting people interact with the model. But before that, set up orchestration deployment monitoring solutions on it so that if, you can measure what are the interactions people are having with these models, essentially.

Abi Aryan: As well. So if something goes wrong on security, you can catch things quickly and turn things off, right? Or if there are a lot of people who are interacting with the model, you can serve next time, okay, I need to allocate X, Y, Z number of resources, or these are the kind of interactions people are having with the model.

Abi Aryan: Essentially, once you've gone through stage two, now the stage three, the full pipeline is essentially you're doing data engineering, then you have, an LLM router which chooses the best base model or the foundation model for you. That really depends on the kind of prompt as well. different prompts can use different kinds of models. let's say if the person [00:38:00] is asking an algorithmic question, then ideally a model which is trained on mathematical information would be much better. and plus the Other question is you don't always need to use the expensive model.

Abi Aryan: Sometimes you can get away with providing a more, generalized information, which is if the person is asking very simplistic question, you don't need to use ChatGPT, you want to have a system that automatically sees that prompt and says, I think for this one, I can,inference on LLAMA-2 instead, I think for this kind of prompt, I can inference on so and so model.

Abi Aryan: One step next after that, once you've done all of that is, doing domain adaptation on the model. Now that can be done in a lot of ways, you can implement prompt engineering pipelines using frameworks like DSPY, or you can implement drag pipelines to introduce more information to the model, without having to retrain the model, or you can do fine tuning and you essentially [00:39:00] do fine tuning when you want to achieve the behavior of the model or how it essentially provides information for you.

Abi Aryan: So tensioning is more like us putting a wrapper. it's very similar to if we say we want the input to be shaped like this. When you're doing structural changes to the input, that's when you're doing prompt engineering. But the moment you say: 'I want the input structure to be changed right now, I want the model to process this information in a different way', then you're essentially doing fine tuning.

Abi Aryan: so data engineering, then implementing an LLM router, then doing some sort of domain adaptation on it, then evaluation and orchestration as well. Orchestration is more like the piece of how do you tie different software components. So how are you doing CI/CD on it? you're optimizing for things over there to be able to now reduce the, Influence latency.?

Abi Aryan: then the next step is doing security and reliability [00:40:00] engineering, which I've not really seen a lot of companies do it, but the companies that are working in banking have already started working very heavily on it because they had the existing infrastructure where they were doing extensive security, reliability, engineering, then a few other ones were doing it, which is the big tech companies, but the more generalized normal companies weren't doing it.

Abi Aryan: But now that has become one of the core stages. The next step is basically doing deployment and monitoring. Once you've done all of that, and the deployment and monitoring is done now. The end user is interacting with the model. So when the end user is interacting with the model,you're learning things because you've implemented monitoring solutions on it. Now you're making additional changes on security as well. you're learning from the data, using the customer interaction data, giving it back to the database as well. So there's that step which gets associated, so there's a loop of data [00:41:00] flywheel that goes back into the engineering stage itself.

Miko Pawlikowski: A few questions. The router... it's an interesting one, because in my mind, if I talk to different models with my every query or my every follow up, I might get different behaviors, right? Isn't that the problem? if you sometimes root to a cheap funky model, because you want to save some money and sometimes it goes to ChatGPT, The quality of my responses might vary significantly. Is there a good way to work around that or it's just how it is??

Abi Aryan: I think as long as your infrastructure is monitoring, which is this output came from this in this model, It's actually ideal because then you can compare the performance of different models on the kind of queries as well and pick which prompts or even pick which models should you be using in vain to subside the use of a particular model within your outer [00:42:00] solution itself as well.

Miko Pawlikowski: And what does a router like this actually look like? Is that a deterministic algorithm? Or is it another model? Is it like turtles all the way down??

Abi Aryan: I've come across, I think probably two companies that have built, a semantic router. they're looking at the semantics of the prompt itself and, based on resource limits set by the client itself, which could be like the company itself.

Abi Aryan: They're picking up a particular model at that point in time. So those are very deterministic solutions. I've not really seen non deterministic solutions put into play where you could actually use a large language model as a routing solution, Or like using a decision tree for, as an LLM router. So I've not really seen those kind of implementations yet.?

Miko Pawlikowski:?

Miko Pawlikowski: What about the evaluation? So that sounds straightforward, but in practice, how do you evaluate freestyle text? do you get people to look at the responses and [00:43:00] compare 'oh, I like this one better. like the chatbot arena'. Or are there more, scientific ways of comparing,different models.?

Abi Aryan: there's more scientific way of comparing different models because you're looking at so many things. You're looking at if the model is engaging, you're looking at if the model is, aware about that particular domain. Is the model really good at question answering? Is the model good at recognizing when it's giving a response that's off as well. so often picking the right model is a little bit harder for that reason. But essentially when you're building an evaluation pipeline for yourself, think about what is your model essentially doing? Are you building a model that is heavily focused on retrieval only? Or are you building a model that's very heavily focused on generation only?

Abi Aryan: Both problems can be broken down, which is retrieval needs to have its own metrics. These can be, context recall, context [00:44:00] precision, basic recall, basic precision as well. And for the more generative use cases, you need to have different metrics as well. So the metrics for the generative solutions or to test the generative performance you have n gram metrics, which are the blue scores, the raw scores that people used to implement in like the conventional NLP models, then you have sem score, which is basically looking at the semantic similarity of the model with a base transformer model, essentially.

Abi Aryan: So a birb score, sem score, mover score. these are essentially called similarity scores. And then there are LLM based scoring as well. so there are three different categories. if. Somebody wants to learn. I'll leave it for the users, which is I have a talk on this thing, particularly, for I did an O'Reilly super stream.

Abi Aryan: So that will give you like a really good framework to think about this. which is how to do [00:45:00] evaluation, how to think about it super systematically, where, this is the actual number that I'm supposed to get, which is if it's above 0. 5, if this number is above 0. 7, then I need to optimize.?

Miko Pawlikowski: Got it. Abi do you think we could make it a little bit more concrete and go through this with some examples??

Abi Aryan: imagine that you own Reddit, right? you've got all this people talking about all the different topics and they tend to be useful in some domains. And let's say that you wanted to build a model that you can chat about, that basically knows all the things that, people at Reddit talk about. And if you wanted to build like a proof of concept to get a model that can answer queries about that, how would you go about that?

Abi Aryan: So very simple would beuse a similar model, which is, now we're looking at, Reddit conversations specifically, right? so what that [00:46:00] essentially is basically some. Internet website that has a lot of information, which has a lot of textual information, by people on a lot of different topics and a lot of different languages as well, though I'm not entirely sure about the language part.

Abi Aryan: So what I would do is. Look at huggingface for models that are trained on conversational data. or sub stack kind of data where people answering questions. so ideally a model that is trained on that kind of information would be my base model. Then the next step would be scraping data from Reddit, essentially.

Abi Aryan: so that would be the next step, which is building my own, dataset pipeline from Reddit, essentially, and doing fine tuning with that. so that, that would be the first two steps, and then, the whole evaluation, security, and all of those things will always be consistent with all of the models, essentially.

Miko Pawlikowski: Got it. So in theory you could take a LLAMA and then [00:47:00] fine tune it on all of Reddit's data and hopefully it would give you something to start with, and then you would have to worry about Evaluating it and all the other things.All right. So I think that probably gives our listeners tonight enough to eagerly await your book now and wonder when they're going to be able to actually read the whole thing or maybe buy it off Amazon. Is there an ETA at the moment that we can give them??

Abi Aryan: the early release would happen sometime next month, which is, we're already in May, it should happen sometime in June, the whole book is supposed to be available by the end of the year.

Miko Pawlikowski: Awesome. All right. And before I let you off the hook, I think you might have seen it coming. I'm going to ask you for some predictions, obviously, with all the caveats, how difficult it is. And,previous performance [00:48:00] is not a guarantee of future gains. where do you see all of this going??

Abi Aryan: I see more people using generative models Instead of the number of people who were using it before, one of the big shifts is which is going to happen is the productivity in that people are getting from these models. So it could be developers. It could be people who are doing copywriting. So companies are getting smaller. And they will continue to get smaller. the number of companies that were working with external people or external audits is going to get smaller as well, I think going into the future, we would be seeing that shift of, you could say create an economy.

Miko Pawlikowski: I'm not entirely sure what would be the right word in the specific scenarios, which is every person is a company. So now instead of every person being a company, a company being 500, 800, and 55,000 employees, they will get certainly much, much, much smaller because one person is going to be able to do [00:49:00] a lot, and there's a lot of stuff that would be automated essentially. along the lines of what Altman was saying about how he's expecting a unicorn single person company very soon because of the increased productivity?

Abi Aryan: I think I would agree with that and, very importantly, this is something which I've mentioned in like the chapter one of my book as well, which is how big is the shift essentially, there were a couple of surveys that were done. And it wouldn't be wrong to say that within the next five years, essentially 28% jobs, at least in some professions, would be eliminated and, they may be eliminated in the sense of like those people become unemployed for a period of time because, now the three people are able to do five people's task. Because again, they've gained more productivity, I don't think people will be [00:50:00] unemployed for long, there will be more and more companies essentially.?

Miko Pawlikowski: Yeah, I think the one thing that I always wonder about is, I remember as a kid reading all these predictions about how all this increases in productivity will mean that people work less than their work, like a couple days a week, and they will just have all this free time. And people are worrying about how that's going to affect an average person having so much free time

Miko Pawlikowski: that's a question one of my friends asked as well, which is what do you think people would do when full automation really happens and they don't think there will ever be full automation, there needs to be monitoring systems that are always put into play, monitoring systems can be automated, they still need to be fine tuned, but all of that, thing is going to be still done by humans. So you could say humans are transitioning from becoming workers to becoming managers.

Miko Pawlikowski: Yeah. I'm still [00:51:00] working probably similar amount of time, but on a slightly more productive way. Yeah. I think, we had this concept of, silent promotion, that we were talking about on one of the previous episodes that overnight, everybody who works with code basically went from single contributor to effectively engineering manager with, Per like junior equivalent, software engineers at their disposal with tools like co pilot and just chatting to ChatGPT.?

Miko Pawlikowski: I have friends who are VCs who are now trying to say, instead of trying to train and associate right now, to teach about, how to look for deals or, how to compile information from different datasets, which could be GitHub, which could be CrunchBase.

Abi Aryan: Why not use a model instead? And there, instead, spending. 50 to 60K on ChatGPT as compared to hiring a person for that essential task. so people need to be more [00:52:00] autonomously driven. and the people who aren't, I think they may have a problem, very soon.?

Miko Pawlikowski: Of that, that billboard 'still hiring humans'. Have you seen that one? Yeah. The, one of those companies, where is it called? The one that. There's the telephone AI where you can call a number. Effectively, the billboard was this massive, phone number to call and asking whether you're still hiring humans and people are calling that.

Miko Pawlikowski: And apparently it can handle million concurrent phone calls or some ridiculous stuff like that. And it's convincingly, replacing like the receptionist or like booking, conversations that you had before. Something that I remember that demo from Google years ago, I have, I'm forgetting what it was called, like duo or something when they had a demo, it was making a reservation and then it never really worked as well as the demo. So it's [00:53:00] we're effectively reaching that at that moment now, just with different companies doing it,

Abi Aryan: maybe this is a realization I do have constantly because I am ADHD. but we're interacting with so much software or so much information, which is isolated. And what we're essentially doing is trying to remember one thing and implement another thing. So We need systems that can interact with all of these systems and be more like assistance for us.

Abi Aryan: And that's where a lot of people are trying to build up agents as well. So from isolated software, we're going towards a system where our software is getting linked as in, it's becoming an ecosystem as well. That is able to communicate and anticipate our requirements. But the downsides of that is still to be predicted, which is, what happens if it goes off?

Abi Aryan: what happens if somebody hacks into the system? the risk of, deploying such systems is really high. [00:54:00] So those are all technical problems that would need to be solved for in, for that particular reason, I think, the field of, safety, which is people who are working in LLMSecOps.

Abi Aryan: And the field of evaluation, which is people who are doing evaluation and monitoring, are going to be some of the most important jobs as compared to people doing fine tuning and all of those things, while those will continue to be important, but the more does that we get from other companies will not eventually with time become really good enough as well, where we may not need to do a lot of those things manually. A lot of work of a machine learning engineer or data scientist will get automated as well.?

Miko Pawlikowski: do you worry about other things that might go wrong with all of this? I don't think that many people are actually worried about, Skynet, materializing tomorrow. But are there things that you're realistically [00:55:00] concerned about in, short, maybe two to five years, time horizon?

Abi Aryan: Yeah, one of the things that does concern me is how are these modelsI'm being used by kids and, we're at the kind of risk that, generative AI does pose to risk in elderly people who don't really realize the difference, between something being generated versus something being true, or should they rely on that to some extent or not?

Abi Aryan: I think the whole spamming industry got so big, or the whole stealing people's credit card information got so big, precisely because people need to stay in touch with the technology, the people who are more vulnerable. Are getting attacked and they're the people who are most at risk.

Abi Aryan: So what really concerns me is not people who are data scientists or machine learning engineers and their jobs going away People i'm concerned most about right now, are the people who are vulnerable [00:56:00] So kids and elderly people who will give a lot of information to ChatGPT hey charge if you look at my, medical details and see you, what problem I may be having as well.

Abi Aryan: And my parents are heavily using chat GP as well, but they don't really realize a lot of information they're going giving into the system can be hacked very easily and they can be phishing attacks. There can be all of those attacks as wel,. eventually.

Miko Pawlikowski: Yeah, and I think the scale is what scares me the most about it, right? The fact that you can do it at a massive scale. There's always been scammers calling, elderly and scamming them out of their money. But now that you can automate it and you can scale it up, you could conceivably just make it a massive problem.

Abi Aryan: And the second bit of that problem is... when you steal all of that data, you're stealing how the person is interacting because large [00:57:00] language models are so good at impersonating or trying to learn how a person structures their question or answers their question. And, the same is happening with an audio speech synthesis as well, which is the models are getting much better at learning the, And the intonations or different, tonal capabilities of different people as well and adapting to them does expose a lot of risk because it becomes so easy to impersonate and spread misinformation or to be able to.

Abi Aryan: Hurt somebody if hurt is a word or is a concentration in that particular scenario, which is, it can impersonate anybody and ask for certain information. It can interact with your child and there's a lot of information as well because people are interacting with these models every single minute of the day as well.

Abi Aryan: And we'll get more and more With all of the systems, which is Google is now integrating their AI systems into Google Docs. [00:58:00] So ChatGPT was already there. Now, Instagram might very soon integrate this. I don't think there will be a world where we can escape generative models as such, and the more we have conversations with them, the more they are learning about their personalities and about everything we're doing on the internet, essentially.?

Miko Pawlikowski: Okay. I'm going to ask you for one more prediction. And then I promise I'll let you have the hook. Today, it's all about OpenAI, this OpenAI that, we also have seen that memo about Google having no moat a year ago. You obviously are deep in the industry. Where do you expect to see the different companies that were used to seeing, your Googles of the world that don't seem to be doing that well with the AI, despite being there at the forefront and that long ago. The different startups that, didn't exist a few years ago and now they're doing [00:59:00] exceptional things. I'm thinking about places like midjourney. Where would you pay attention to the most? Where do you expect to see the good stuff coming from?

Abi Aryan: So I would say it would change, which is, The companies that were able to be monopolies now, it will be very hard to be a monopoly that easily without by just trying to build software. so by acquisition, yes, you can be a monopoly, which is trying to acquire everybody, which is essentially what Google was doing.

Abi Aryan: So a lot of people do think Google is essentially business world building products. No, they were essentially acquiring all the small companies that were building excellent products.before they became big, and that's a word we're moving further into because the bigger companies do have the infinite resources, compute resources as well to be able to control the ecosystem?

Abi Aryan: So I would say we will more likely see. More monopolies, but those wouldn't be monopolies because they [01:00:00] have an excellent product. Those would be monopolies because they have more access to information, and they have higher number of resources out there. The number of small companies, yes, there will be many, but I do it easily assume, there, there will be still tons of companies who make exits as compared to becoming the victim in every hype cycle. And I would say we're going through a hype cycle right now where there's far too much paranoia and there's far too much excitement. There's very little realism around, the business value being derived out of these models essentially.

Abi Aryan: so in this hype cycle, there are always a lot of companies that get created. Two years from now, there's a very good chance at least seven out of ten of those companies will die.

Miko Pawlikowski: Okay. And on that optimistic note, we're going to wrap up the episode. my guest, once again, [01:01:00] everybody was Abi Arian. You can find her at abbyarian. com. Is that the best place to find you?

Abi Aryan: Yep, so that's one place where you find all the information or where I'm giving talks, because essentially that's where I'm presenting bits of information from my book and testing out my material. So the best place to find information about me or to find social media. And if the links change, but otherwise, I'm @goabiarian on Twitter, on threads, on Instagram, on LinkedIn.

Abi Aryan: So I use the same username everywhere. You can find me.

Miko Pawlikowski: There you go. Abi is omnipresent, always watching you on every platform and the book once again called "LLM Ops, Managing Large Language Models in Production", published by the O'Reilly. Thank you so much, Abhi. Thank you for coming.

Abi Aryan: Thank you so much.


Hrijul Dey

AI Engineer| LLM Specialist| Python Developer|Tech Blogger

1 周

Revolutionize your LinkedIn analytics game! Just discovered 'Learn DSPy': Analyze LinkedIn Posts with DSPy and Pandas. This could be a game-changer for our marketing strategy. Looking forward to extracting meaningful insights https://www.artificialintelligenceupdate.com/analyze-linkedin-posts-with-dspy-and-pandas/riju/ #learnmore #AI&U #LinkedInAnalytics #DataScience #Pandas

回复
Hrijul Dey

AI Engineer| LLM Specialist| Python Developer|Tech Blogger

2 周

Harnessing the power of LinkedIn data just got easier! Explore DSPy's innovative approach to analyze posts and uncover hidden insights. Boost your strategy with AI&U's latest tool – unlock your potential now https://www.artificialintelligenceupdate.com/analyze-linkedin-posts-with-dspy-and-pandas/riju/ #learnmore #AI&U

回复
Hrijul Dey

AI Engineer| LLM Specialist| Python Developer|Tech Blogger

2 周

Revolutionize your LinkedIn analytics game with DSPy & Pandas! Discover untapped insights, track trends, and boost engagement. Let's dive in: https://www.artificialintelligenceupdate.com/analyze-linkedin-posts-with-dspy-and-pandas/riju/ #learnmore #AI&U

回复
Hrijul Dey

AI Engineer| LLM Specialist| Python Developer|Tech Blogger

2 个月

Elevate your LinkedIn game with DSPy and Pandas! Extract meaningful insights from posts, make informed decisions, and let your network know you mean business. https://www.artificialintelligenceupdate.com/analyze-linkedin-posts-with-dspy-and-pandas/riju/ #learnmore

回复
Ahmed Rashed ??

??19M+ impression | Believer in Individuals with?a?Vision??? | Futurist | Tech Visionary |#1 Qatar Favikon LinkedIn | ?? Innovation Enthusiast - Angel Investor with a Passion for Innovation | ??

4 个月

Insightful conversation on LLMOps - exciting new frontier. Kudos for exploration.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了