登录查看更多内容

Choosing a tool for logging, tracing & evals - how to?

Artur Wala

Founder @ ModelGuide || AI Tinkerers Poland

发布日期: 2024年10月24日

LangSmith, Langfuse, any other of +80 competitors? Or write custom tool? GenAI techstack is still in the ferment era - it’s pretty hard to decide on which tooling (if any) to use.

Grzegorz Kossakowski in his article wrote “Unlike traditional software architecture, AI engineering is more about innovating than selecting existing solutions” - with that in mind, I wanted to share a rationale for choosing a tool - I hope that helps ??

So first, critical things one must do to know whether LLM is behaving in an expected way and push the product closer to production:

Tracing & logging - We must log model/chain runs to easily debug (see what’s happening under the hood). This must happen in the cloud, can’t keep that in repo. There must be a clear versioning to keep track of each log.
Evals (dev-run & user-run) - We need to create datasets of examples and run evals on them. Local runs during the dev process need to be quick. Ideally, you have a nice UI to visually see the results. They should be tracked to see overall trends, not just to “run and forget”. A big must have is quick link eval ? trace: if eval goes wrong you need in no time to go from eval to related trace, to see what happened.
Collect feedback from clients - So far I find this the biggest flaw of LLM devtools. To make sure the model works correctly, you need human experts who know the domain. But if you invite them to annotate & judge you can’t expect them to work in dev/adminpanel. They’re cosmetologist, lawyers, salespeople, recruiters… They need to give feedback based on their data, with minimal friction: nontechnical single-button, good/bad reviews, with a place for comments and domain feedback. On top of that, user feedback should be logged and coupled with traces, so you can easily turn feedback into a new experiment in the dataset.

Then, what to take into account before making a decision?

Tools are super-early and immature. You can safely assume that you can rely on them in basic use cases, but they’ll need a lot of customization.
Don’t start with a custom solution. Start with tools for logging/tracing, make sure the experiment runs (from evals) are naturally connected to traces, etc, etc. Figure out what to customize, and what you’re missing, maybe write an extension?
Don’t assume that tools are great. Let’s use the tools to the absolute minimum, use simple features that you need. Let’s extend over them, not rewrite their features. Probably to remove frictions while looking at data, you’d need a custom, domain-specific way of rendering traces(as Hamel H. wrote in a classic blog post: “Your AI Product Needs Eval” ).
You can always migrate. Let’s always assume you can migrate to the other tool soon. Migrating eval datasets is trivial, don’t overcomplicate the thought process.

We went with Langfuse (YC W23) , despite LangSmith being a more mature, feature-rich first mover. Why?

领英推荐

Observability 2025: Navigating Costs, Complexity, and…

Andrew Mallaband 2 个月前

Selecting the Right Foundation Model for Your Use Case

Dr. Rabi Prasad Padhy 4 个月前

The New Viewbox-World is Coming!

AvenDATA 5 个月前

it’s open source
nice signs of traction, YC batch, they’re popular on X. Seriously: popularity → integrations. integrations → speed. and we need speed, oh boy!
pricing per observation is waaaay more reasonable for a bootstrapped startup
there’s an easy way to integrate with Llamaindex
Marc Klingen & the team are cool guys ??

To sum up:

let’s use tools in lightweight mode, but not overly rely on them.

One thing to keep in mind: RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!! RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!! RUN EVALS RUN EVALS RUN EVALS RUN EVALS RUN EVALS !!!

I’m not saying our decision is perfect. But it was made fast. We’ll figure out the rest soon.

Robert Chandler , Raza Habib , Rafal Cymerys - what advice would you give to people figuring out their stack?

Rafal Cymerys

CTO & Founder at Upside | eCommerce | Data Projects

5 个月

Thanks for the mention Artur Wala! From what I can see, these tools are evolving so quickly that I'd say just pick whatever seems right to you at the moment and reassess after some time.

1 次回应

Robert Chandler

cto @ wordware

5 个月

I'd say the most important evaluation to have in the early stage of your AI product development is the domain expert. They're the ones that know what good looks like and no eval tool will help you get there. You're best bet is to use Wordware (YC S24) and put your PMs in driving seat so the team can iterate 1000x faster and build something genuinely excellent. Evals are great once you've got something working to find regressions to one function when you're adding new features to your agent or exploring switching to different models!

1 次回应

J?drzej Kaniewski

AI engineer focused on audio data, if you're interested in voice analytics, voicebots, or something along these lines, please get in touch :)

5 个月

If what you're developing is a mobile app, then collecting feedback could be as simple as recording a short voice message ??

1 次回应

Ekaterina Dmitrieva

MSc Data Science & Artificial Intelligence Strategy | emlyon х McGill | Ex Data Analyst and Engineer | Adept of converting complexities into effective results | Relentless optimist

5 个月

Thank you for such a summary, Artur! It was very helpful ??

1 次回应

Grzegorz Kossakowski

Previously Developer Experience and AI @stripe

5 个月

Great write-up! All tools being immature and not exactly fitting your use case is exactly the reason why I emphasized innovation over shopping around.

2 次回应

查看更多评论

要查看或添加评论，请登录

Artur Wala的更多文章

AI costs & how to reduce them - business guide

2024年12月18日

AI costs & how to reduce them - business guide

Intro In contrast to traditional SaaS, Large Language Model (LLM) based solutions incur costs that scale with the…

5 条评论
When creativity meets lab. What are the stages to get things right?

2021年8月18日

When creativity meets lab. What are the stages to get things right?

It's not magic, it's a process. This is what sprints at the Innovation Lab look like.
AI Chatbot enters the B2B market to revolutionize customer service

2021年6月29日

AI Chatbot enters the B2B market to revolutionize customer service

B2B customers expect immediate service, and they want to take it into their own hands for as long as possible. They…

Choosing a tool for logging, tracing & evals - how to?

Artur Wala

Founder @ ModelGuide || AI Tinkerers Poland

Then, what to take into account before making a decision?

领英推荐

Artur Wala的更多文章

社区洞察

其他会员也浏览了

How Automated Testing Strengthens MLOps Pipelines

AI Morphic Framework (AIMF): A Vision for Self-Evolving Software

Transforming Industries: How AI is Fueling Growth ??

AutoGenesisAgent: Self-Generating Multi-Agent Systems for Complex Tasks

Leveraging Aspect-AI for Flexible and User-Friendly Software

Now AI Agents Can Use Computer Software – A New Era in AI Automation !

#60 AI: Hype Cycles and Bubbles!

Intelligent Document Processing (IDP) Review - Planet Ai

April 17, 2024

Then, what to take into account before making a decision?

领英推荐

Artur Wala的更多文章

AI costs & how to reduce them - business guide

When creativity meets lab. What are the stages to get things right?

AI Chatbot enters the B2B market to revolutionize customer service

社区洞察

其他会员也浏览了

How Automated Testing Strengthens MLOps Pipelines

AI Morphic Framework (AIMF): A Vision for Self-Evolving Software

Transforming Industries: How AI is Fueling Growth ??

AutoGenesisAgent: Self-Generating Multi-Agent Systems for Complex Tasks

Leveraging Aspect-AI for Flexible and User-Friendly Software

Now AI Agents Can Use Computer Software – A New Era in AI Automation !

#60 AI: Hype Cycles and Bubbles!

Intelligent Document Processing (IDP) Review - Planet Ai

April 17, 2024