登录查看更多内容

SSVO is the New ETL

Sean Robinson, Ph.D.

Data Scientist/Research Lead at Madrona Venture Labs

发布日期: 2024年5月19日

I think I've been asked the same question 25 different ways in the last few years, by many different people. It's probably a question you've had yourself. Ready for it?

?“When is our messaging good or bad, and how do we make it better?” Sometimes that messaging is advertising, or sales outreach, or political campaign materials, or job descriptions, or content pages, or any number of other things. You've probably encountered it yourself. ? It boils down to basically the same question - “We have a big mess of mostly-unstructured data that represents all the messaging we tried before, and an idea of when it was successful or not (mostly not, by weight). Now - what do we do?”

I can see why it gets asked so much - this is an important question, if you can create a great answer. But what really happens? Well, for most of a decade, the conversation always goes down a similar, mostly-unsuccessful path.? And that path is the following:

We could try to laboriously A-B test with existing or new materials, manually making more and more examples and exploring the space until we find better ones.? But we need experts in that task to give us a bunch of options to use, since there isn’t any obvious “direction” to go next. Or
We could attempt a much-more-complicated generative scheme to make new materials (maybe GPT2-3 back in the day, or a mad-libs-style scheme), and then try to use optimizer techniques on that generator to achieve the best messaging. But there’s no real guarantee that we’ll be exploring the “right” messaging or changing stuff that matters rather than useless things, so we might be wasting time and money by putting things out there that were always going to do poorly.

And all of this….usually fails on a real customer.? Why?? Well, when #1 comes up, the answer tends to be “we were hoping you could tell us what the best messages were.”? And with #2, they tend to not be willing to put real money on the line until they have a solid belief that the new messages will be worth it.

So what gives? Well, #1 and #2 above are really compromises - partial methods, if you will.? As data professionals, here’s what we’d really like to do, and almost never could:

Somehow make an embedded space that is relevant to the problem.? In other words, every piece of media needs to correspond to a point somewhere in a multidimensional space.? And as importantly, moving around in that space needs to change things that are clear to a human being, and actually relevant to the problem, so that the size of that space isn’t too big.? So, the “dimensions” need to be things that really matter, are clearly defined, and probably affect the success rate
Put all the existing data into that space, and use the past success to train an ML model to predict where the best messaging probably lives, and give guidance based on that (e.g. feature importance to decide what dimensions actually mattered and which ones didn’t)
Use a combination of algorithms (gradient descent, simulated annealing, whatever) to make best-guesses as to where the new “best messages” are going to be, and pair this with some kind of a generator (this was a real sticking point) to create new messaging and keep walking toward the best answers

Okay, that sounded elaborate, but honestly, it’s what would come from combining the best parts of #1 and #2 above, if we could only define that elusive embedded space.? And now, suddenly, it seems like we can.??

Data & Analytics 7 个月前

Insight from the Chaos: RAG Application Step-by-Step…

Raneem Althaqafi 1 个月前

Feature Engineering - Extract column from JSON

Mage 2 年前

So, the modern Large Language Model (LLM) can really only operate on text put into an embedded space of some kind.? On the way into the model, there has to be a structure that takes text of whatever meaning and length, and turns it into a vector of numbers.? And the new LLMs are really, really good, but internally most of them embed at the word-level and then add position information on the way through.? And there are very good document-level embedders, so it’s possible to toss all your text into that machine and then figure out what text is most-similar by looking at what corresponding vectors are closest.? But…that’s not quite what we’re going to do.? Technically, we’ve had decent embedders for awhile now, and the problem with just putting all your text in and examining the vectors that come out is that a) it’s a fairly big space, and b) it can be difficult to invert the process - so even if you came up with some new vector you think might perform better, it’s tough to get a meaningful text output that corresponds to it, especially with models like GPT-4 that don’t take that “sort” of embedding as an input.??

No, what we really want is a space that makes sense to our specific problem, and only represents the specific factors that both vary across, and matter to, our application.? And here, the modern LLM has our back again.? Because of the multimodal (text+image) inputs as well as a handful of prompt “tricks,” it’s actually fairly possible to scrape up all of this amorphous/unstructured data, and simply ask the model to explain the major differences between a large set of media.

And okay, I lie slightly, because what actually happens if you do that is that you get a set of qualities that vary, but if you try it a second time you’ll get a slightly different set.? Stability is an issue, and while one run claims tha the price stated in an ad is important, another one will focus more on whether the message has a call to action or not.? But if you:

Do this sort of extraction a bunch of times
Take the whole set of extracted concepts across all those extractions and embed them all into vectors
Cluster that embedded space of concept vectors and then pull the now-clustered concepts back out and
Ask the LLM one more time to summarize the concepts in each cluster into the simplest single description

Then you get a pretty stable “space” of things that actually matter, plus a way to place new proposed messages into that space by figuring out how they compare to the average along those new, “human readable”? dimensions.? And that, I think, will be the underlying engine that will finally make possible that “product that everyone keeps asking for but never seems to really get.”? In short, I think there’s going to be a “new ETL” (which stands for Extract, Transform, and Load, a set of processes that are well enshrined in the data world, and for which there are fairly ubiquitous tools).? That is, there will be a new sort of standard product/pipeline tools that everyone needs all the time.? And I believe it will go something like this:

Scrape: Go and pull out all the messy, unstructured, multimodal examples that you have access to
Structure: Use techniques like the above to place each one of those into a feature “space” where every dimension is something that actually matters or varies in this context
Validate:? Use our LLM/NLP toolset again, this time to verify that our space and placements are right.? So if some of your letters were formal and others informal, we should have pulled out “formality” as a dimension and the data should be projected into it correctly. At this stage, we will also prune the space of dimensions that don’t actually matter to overall success, which can happen automatically and is kind of a fringe benefit we get for having done it this way
Optimize:? At long last, we can finally write a generator that operates on this space, because we have explicit meanings for all the dimensions.? Which means that any point we’d like to try has an explicit set of things that can be expressed as a prompt (e.g. “we want an ad that has a picture, mentions cost explicitly, and has an informal personal tone”).? Then we can use the various Data Science optimization techniques to generate new campaigns that should be very understandable and also allow continually improving outcome

There is definitely work to be done before everyone is talking about “SSVO systems for business” (or whatever the actual acronym ends up being).? For example, there’s still the matter of doing the same scrape+structure “trick” on customer segmentation, etc.? But the upside is that this kind of approach is designed to conform to the problem at hand, so things like making a “success probability matrix” (in which customer segments and outgoing messaging are both structured and you get a best-messaging-for-each-customer) is very much possible.? And just like ETL pipelines, it will require a lot of validation - but ideally, the production of results stays human-legible at every step, and experts ought to be able to gut-check outputs along the entire process.??

Will this end up being a product?? I’ve been thinking of it more like “something we should probably all get good at,” but sometimes that means a product.? Being a data scientist often means trying as hard as you can to rewrite a hard problem into an easy one - to get the “terms of engagement” to match what we have tools for.? So whenever we get new tools, there will be some really-intractable problems that we can suddenly bring our best toolkit to bear on.? There are definitely a few more of these that we’ll see in the next couple of years, and I’m honestly excited about this one.? Finally an answer to the question we keep being asked!

要查看或添加评论，请登录

Sean Robinson, Ph.D.的更多文章

Mighty Fighting (negotiating) Robots

2024年1月26日

Mighty Fighting (negotiating) Robots

I’ve always been a poor negotiator. I’m better than I used to be- several turns around the startup world will do that…

3 条评论
Get Your Data From Nothing

2023年9月15日

Get Your Data From Nothing

Knowledge is power. Unless it's knowledge of how great your new cult is.

3 条评论
A Simple Take on Superconductivity

2023年8月8日

A Simple Take on Superconductivity

If you didn’t know me from a very long way back, you probably wouldn’t know that I did my undergraduate work in…
How could AI fix(break) the dating market

2023年7月15日

How could AI fix(break) the dating market

Someone recently asked me, “do you think we can use AI to pick plastic out of the ocean, or do all the robots need to…

1 条评论

SSVO is the New ETL

Sean Robinson, Ph.D.

Data Scientist/Research Lead at Madrona Venture Labs

领英推荐

Sean Robinson, Ph.D.的更多文章

社区洞察

其他会员也浏览了

The Mystery of NULL Values: Why They Matter and How to Tackle Them

TAMING TEXT

UNCOVERED: The Sinister Origins of ETL, ELT, and Data Pipelines

Improve your Data Science workflow with Optimus

Week of May 13th

Data .. simplified!

A way to avoid the "void data type" in PySpark and Delta

OA Digest - February 2023 Edition

LLMs for Text-to-SQL problems: the benchmark vs real-world performance

Why should I use TypeDB for my graph data?

领英推荐

Sean Robinson, Ph.D.的更多文章

Mighty Fighting (negotiating) Robots

Get Your Data From Nothing

A Simple Take on Superconductivity

How could AI fix(break) the dating market

社区洞察

其他会员也浏览了

The Mystery of NULL Values: Why They Matter and How to Tackle Them

TAMING TEXT

UNCOVERED: The Sinister Origins of ETL, ELT, and Data Pipelines

Improve your Data Science workflow with Optimus

Week of May 13th

Data .. simplified!

A way to avoid the "void data type" in PySpark and Delta

OA Digest - February 2023 Edition

LLMs for Text-to-SQL problems: the benchmark vs real-world performance

Why should I use TypeDB for my graph data?