Choosing a tool for logging, tracing & evals - how to?

Choosing a tool for logging, tracing & evals - how to?

LangSmith, Langfuse, any other of +80 competitors? Or write custom tool? GenAI techstack is still in the ferment era - it’s pretty hard to decide on which tooling (if any) to use.

Grzegorz Kossakowski in his article wrote “Unlike traditional software architecture, AI engineering is more about innovating than selecting existing solutions” - with that in mind, I wanted to share a rationale for choosing a tool - I hope that helps ??

So first, critical things one must do to know whether LLM is behaving in an expected way and push the product closer to production:

  1. Tracing & logging - We must log model/chain runs to easily debug (see what’s happening under the hood). This must happen in the cloud, can’t keep that in repo. There must be a clear versioning to keep track of each log.
  2. Evals (dev-run & user-run) - We need to create datasets of examples and run evals on them. Local runs during the dev process need to be quick. Ideally, you have a nice UI to visually see the results. They should be tracked to see overall trends, not just to “run and forget”. A big must have is quick link eval ? trace: if eval goes wrong you need in no time to go from eval to related trace, to see what happened.
  3. Collect feedback from clients - So far I find this the biggest flaw of LLM devtools. To make sure the model works correctly, you need human experts who know the domain. But if you invite them to annotate & judge you can’t expect them to work in dev/adminpanel. They’re cosmetologist, lawyers, salespeople, recruiters… They need to give feedback based on their data, with minimal friction: nontechnical single-button, good/bad reviews, with a place for comments and domain feedback. On top of that, user feedback should be logged and coupled with traces, so you can easily turn feedback into a new experiment in the dataset.

Then, what to take into account before making a decision?

  1. Tools are super-early and immature. You can safely assume that you can rely on them in basic use cases, but they’ll need a lot of customization.
  2. Don’t start with a custom solution. Start with tools for logging/tracing, make sure the experiment runs (from evals) are naturally connected to traces, etc, etc. Figure out what to customize, and what you’re missing, maybe write an extension?
  3. Don’t assume that tools are great. Let’s use the tools to the absolute minimum, use simple features that you need. Let’s extend over them, not rewrite their features. Probably to remove frictions while looking at data, you’d need a custom, domain-specific way of rendering traces(as Hamel H. wrote in a classic blog post: “Your AI Product Needs Eval” ).
  4. You can always migrate. Let’s always assume you can migrate to the other tool soon. Migrating eval datasets is trivial, don’t overcomplicate the thought process.

We went with Langfuse (YC W23) , despite LangSmith being a more mature, feature-rich first mover. Why?

  • it’s open source
  • nice signs of traction, YC batch, they’re popular on X. Seriously: popularity → integrations. integrations → speed. and we need speed, oh boy!
  • pricing per observation is waaaay more reasonable for a bootstrapped startup
  • there’s an easy way to integrate with Llamaindex
  • Marc Klingen & the team are cool guys ??

To sum up:

  • let’s use tools in lightweight mode, but not overly rely on them.

One thing to keep in mind: RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!! RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!! RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!!        

I’m not saying our decision is perfect. But it was made fast. We’ll figure out the rest soon.

Robert Chandler , Raza Habib , Rafal Cymerys - what advice would you give to people figuring out their stack?

Rafal Cymerys

CTO & Founder at Upside | eCommerce | Data Projects

4 个月

Thanks for the mention Artur Wala! From what I can see, these tools are evolving so quickly that I'd say just pick whatever seems right to you at the moment and reassess after some time.

Robert Chandler

cto @ wordware

4 个月

I'd say the most important evaluation to have in the early stage of your AI product development is the domain expert. They're the ones that know what good looks like and no eval tool will help you get there. You're best bet is to use Wordware (YC S24) and put your PMs in driving seat so the team can iterate 1000x faster and build something genuinely excellent. Evals are great once you've got something working to find regressions to one function when you're adding new features to your agent or exploring switching to different models!

J?drzej Kaniewski

AI engineer focused on audio data, if you're interested in voice analytics, voicebots, or something along these lines, please get in touch :)

5 个月

If what you're developing is a mobile app, then collecting feedback could be as simple as recording a short voice message ??

Ekaterina Dmitrieva

MSc Data Science & Artificial Intelligence Strategy | emlyon х McGill | Ex Data Analyst and Engineer | Adept of translating complex data into actionable insights.

5 个月

Thank you for such a summary, Artur! It was very helpful ??

Grzegorz Kossakowski

Previously Developer Experience and AI @stripe

5 个月

Great write-up! All tools being immature and not exactly fitting your use case is exactly the reason why I emphasized innovation over shopping around.

要查看或添加评论,请登录

Artur Wala的更多文章

社区洞察

其他会员也浏览了