SSVO is the New ETL
I think I've been asked the same question 25 different ways in the last few years, by many different people. It's probably a question you've had yourself. Ready for it?
?“When is our messaging good or bad, and how do we make it better?” Sometimes that messaging is advertising, or sales outreach, or political campaign materials, or job descriptions, or content pages, or any number of other things. You've probably encountered it yourself. ? It boils down to basically the same question - “We have a big mess of mostly-unstructured data that represents all the messaging we tried before, and an idea of when it was successful or not (mostly not, by weight). Now - what do we do?”
I can see why it gets asked so much - this is an important question, if you can create a great answer. But what really happens? Well, for most of a decade, the conversation always goes down a similar, mostly-unsuccessful path.? And that path is the following:
And all of this….usually fails on a real customer.? Why?? Well, when #1 comes up, the answer tends to be “we were hoping you could tell us what the best messages were.”? And with #2, they tend to not be willing to put real money on the line until they have a solid belief that the new messages will be worth it.
So what gives? Well, #1 and #2 above are really compromises - partial methods, if you will.? As data professionals, here’s what we’d really like to do, and almost never could:
Okay, that sounded elaborate, but honestly, it’s what would come from combining the best parts of #1 and #2 above, if we could only define that elusive embedded space.? And now, suddenly, it seems like we can.??
领英推荐
So, the modern Large Language Model (LLM) can really only operate on text put into an embedded space of some kind.? On the way into the model, there has to be a structure that takes text of whatever meaning and length, and turns it into a vector of numbers.? And the new LLMs are really, really good, but internally most of them embed at the word-level and then add position information on the way through.? And there are very good document-level embedders, so it’s possible to toss all your text into that machine and then figure out what text is most-similar by looking at what corresponding vectors are closest.? But…that’s not quite what we’re going to do.? Technically, we’ve had decent embedders for awhile now, and the problem with just putting all your text in and examining the vectors that come out is that a) it’s a fairly big space, and b) it can be difficult to invert the process - so even if you came up with some new vector you think might perform better, it’s tough to get a meaningful text output that corresponds to it, especially with models like GPT-4 that don’t take that “sort” of embedding as an input.??
No, what we really want is a space that makes sense to our specific problem, and only represents the specific factors that both vary across, and matter to, our application.? And here, the modern LLM has our back again.? Because of the multimodal (text+image) inputs as well as a handful of prompt “tricks,” it’s actually fairly possible to scrape up all of this amorphous/unstructured data, and simply ask the model to explain the major differences between a large set of media.
And okay, I lie slightly, because what actually happens if you do that is that you get a set of qualities that vary, but if you try it a second time you’ll get a slightly different set.? Stability is an issue, and while one run claims tha the price stated in an ad is important, another one will focus more on whether the message has a call to action or not.? But if you:
Then you get a pretty stable “space” of things that actually matter, plus a way to place new proposed messages into that space by figuring out how they compare to the average along those new, “human readable”? dimensions.? And that, I think, will be the underlying engine that will finally make possible that “product that everyone keeps asking for but never seems to really get.”? In short, I think there’s going to be a “new ETL” (which stands for Extract, Transform, and Load, a set of processes that are well enshrined in the data world, and for which there are fairly ubiquitous tools).? That is, there will be a new sort of standard product/pipeline tools that everyone needs all the time.? And I believe it will go something like this:
There is definitely work to be done before everyone is talking about “SSVO systems for business” (or whatever the actual acronym ends up being).? For example, there’s still the matter of doing the same scrape+structure “trick” on customer segmentation, etc.? But the upside is that this kind of approach is designed to conform to the problem at hand, so things like making a “success probability matrix” (in which customer segments and outgoing messaging are both structured and you get a best-messaging-for-each-customer) is very much possible.? And just like ETL pipelines, it will require a lot of validation - but ideally, the production of results stays human-legible at every step, and experts ought to be able to gut-check outputs along the entire process.??
Will this end up being a product?? I’ve been thinking of it more like “something we should probably all get good at,” but sometimes that means a product.? Being a data scientist often means trying as hard as you can to rewrite a hard problem into an easy one - to get the “terms of engagement” to match what we have tools for.? So whenever we get new tools, there will be some really-intractable problems that we can suddenly bring our best toolkit to bear on.? There are definitely a few more of these that we’ll see in the next couple of years, and I’m honestly excited about this one.? Finally an answer to the question we keep being asked!