SSVO is the New ETL

SSVO is the New ETL

I think I've been asked the same question 25 different ways in the last few years, by many different people. It's probably a question you've had yourself. Ready for it?

?“When is our messaging good or bad, and how do we make it better?” Sometimes that messaging is advertising, or sales outreach, or political campaign materials, or job descriptions, or content pages, or any number of other things. You've probably encountered it yourself. ? It boils down to basically the same question - “We have a big mess of mostly-unstructured data that represents all the messaging we tried before, and an idea of when it was successful or not (mostly not, by weight). Now - what do we do?”

I can see why it gets asked so much - this is an important question, if you can create a great answer. But what really happens? Well, for most of a decade, the conversation always goes down a similar, mostly-unsuccessful path.? And that path is the following:

  1. We could try to laboriously A-B test with existing or new materials, manually making more and more examples and exploring the space until we find better ones.? But we need experts in that task to give us a bunch of options to use, since there isn’t any obvious “direction” to go next. Or
  2. We could attempt a much-more-complicated generative scheme to make new materials (maybe GPT2-3 back in the day, or a mad-libs-style scheme), and then try to use optimizer techniques on that generator to achieve the best messaging. But there’s no real guarantee that we’ll be exploring the “right” messaging or changing stuff that matters rather than useless things, so we might be wasting time and money by putting things out there that were always going to do poorly.

And all of this….usually fails on a real customer.? Why?? Well, when #1 comes up, the answer tends to be “we were hoping you could tell us what the best messages were.”? And with #2, they tend to not be willing to put real money on the line until they have a solid belief that the new messages will be worth it.

So what gives? Well, #1 and #2 above are really compromises - partial methods, if you will.? As data professionals, here’s what we’d really like to do, and almost never could:

  • Somehow make an embedded space that is relevant to the problem.? In other words, every piece of media needs to correspond to a point somewhere in a multidimensional space.? And as importantly, moving around in that space needs to change things that are clear to a human being, and actually relevant to the problem, so that the size of that space isn’t too big.? So, the “dimensions” need to be things that really matter, are clearly defined, and probably affect the success rate
  • Put all the existing data into that space, and use the past success to train an ML model to predict where the best messaging probably lives, and give guidance based on that (e.g. feature importance to decide what dimensions actually mattered and which ones didn’t)
  • Use a combination of algorithms (gradient descent, simulated annealing, whatever) to make best-guesses as to where the new “best messages” are going to be, and pair this with some kind of a generator (this was a real sticking point) to create new messaging and keep walking toward the best answers

Okay, that sounded elaborate, but honestly, it’s what would come from combining the best parts of #1 and #2 above, if we could only define that elusive embedded space.? And now, suddenly, it seems like we can.??

So, the modern Large Language Model (LLM) can really only operate on text put into an embedded space of some kind.? On the way into the model, there has to be a structure that takes text of whatever meaning and length, and turns it into a vector of numbers.? And the new LLMs are really, really good, but internally most of them embed at the word-level and then add position information on the way through.? And there are very good document-level embedders, so it’s possible to toss all your text into that machine and then figure out what text is most-similar by looking at what corresponding vectors are closest.? But…that’s not quite what we’re going to do.? Technically, we’ve had decent embedders for awhile now, and the problem with just putting all your text in and examining the vectors that come out is that a) it’s a fairly big space, and b) it can be difficult to invert the process - so even if you came up with some new vector you think might perform better, it’s tough to get a meaningful text output that corresponds to it, especially with models like GPT-4 that don’t take that “sort” of embedding as an input.??

No, what we really want is a space that makes sense to our specific problem, and only represents the specific factors that both vary across, and matter to, our application.? And here, the modern LLM has our back again.? Because of the multimodal (text+image) inputs as well as a handful of prompt “tricks,” it’s actually fairly possible to scrape up all of this amorphous/unstructured data, and simply ask the model to explain the major differences between a large set of media.

And okay, I lie slightly, because what actually happens if you do that is that you get a set of qualities that vary, but if you try it a second time you’ll get a slightly different set.? Stability is an issue, and while one run claims tha the price stated in an ad is important, another one will focus more on whether the message has a call to action or not.? But if you:

  • Do this sort of extraction a bunch of times
  • Take the whole set of extracted concepts across all those extractions and embed them all into vectors
  • Cluster that embedded space of concept vectors and then pull the now-clustered concepts back out and
  • Ask the LLM one more time to summarize the concepts in each cluster into the simplest single description

Then you get a pretty stable “space” of things that actually matter, plus a way to place new proposed messages into that space by figuring out how they compare to the average along those new, “human readable”? dimensions.? And that, I think, will be the underlying engine that will finally make possible that “product that everyone keeps asking for but never seems to really get.”? In short, I think there’s going to be a “new ETL” (which stands for Extract, Transform, and Load, a set of processes that are well enshrined in the data world, and for which there are fairly ubiquitous tools).? That is, there will be a new sort of standard product/pipeline tools that everyone needs all the time.? And I believe it will go something like this:

  • Scrape: Go and pull out all the messy, unstructured, multimodal examples that you have access to
  • Structure: Use techniques like the above to place each one of those into a feature “space” where every dimension is something that actually matters or varies in this context
  • Validate:? Use our LLM/NLP toolset again, this time to verify that our space and placements are right.? So if some of your letters were formal and others informal, we should have pulled out “formality” as a dimension and the data should be projected into it correctly. At this stage, we will also prune the space of dimensions that don’t actually matter to overall success, which can happen automatically and is kind of a fringe benefit we get for having done it this way
  • Optimize:? At long last, we can finally write a generator that operates on this space, because we have explicit meanings for all the dimensions.? Which means that any point we’d like to try has an explicit set of things that can be expressed as a prompt (e.g. “we want an ad that has a picture, mentions cost explicitly, and has an informal personal tone”).? Then we can use the various Data Science optimization techniques to generate new campaigns that should be very understandable and also allow continually improving outcome

There is definitely work to be done before everyone is talking about “SSVO systems for business” (or whatever the actual acronym ends up being).? For example, there’s still the matter of doing the same scrape+structure “trick” on customer segmentation, etc.? But the upside is that this kind of approach is designed to conform to the problem at hand, so things like making a “success probability matrix” (in which customer segments and outgoing messaging are both structured and you get a best-messaging-for-each-customer) is very much possible.? And just like ETL pipelines, it will require a lot of validation - but ideally, the production of results stays human-legible at every step, and experts ought to be able to gut-check outputs along the entire process.??

Will this end up being a product?? I’ve been thinking of it more like “something we should probably all get good at,” but sometimes that means a product.? Being a data scientist often means trying as hard as you can to rewrite a hard problem into an easy one - to get the “terms of engagement” to match what we have tools for.? So whenever we get new tools, there will be some really-intractable problems that we can suddenly bring our best toolkit to bear on.? There are definitely a few more of these that we’ll see in the next couple of years, and I’m honestly excited about this one.? Finally an answer to the question we keep being asked!

要查看或添加评论,请登录

Sean Robinson, Ph.D.的更多文章

  • Mighty Fighting (negotiating) Robots

    Mighty Fighting (negotiating) Robots

    I’ve always been a poor negotiator. I’m better than I used to be- several turns around the startup world will do that…

    3 条评论
  • Get Your Data From Nothing

    Get Your Data From Nothing

    Knowledge is power. Unless it's knowledge of how great your new cult is.

    3 条评论
  • A Simple Take on Superconductivity

    A Simple Take on Superconductivity

    If you didn’t know me from a very long way back, you probably wouldn’t know that I did my undergraduate work in…

  • How could AI fix(break) the dating market

    How could AI fix(break) the dating market

    Someone recently asked me, “do you think we can use AI to pick plastic out of the ocean, or do all the robots need to…

    1 条评论

社区洞察

其他会员也浏览了