How can you get an AI to produce a coherent story from an outline?
Alexis Radcliff
Experienced Product and Technology Leader | AI Integration Specialist | SDE Expertise
This is the thing everyone seems to want to do, and it's feedback I get a lot about TextSpark too. But even the best textgen models on the market can't do this effectively right now (so far).
I didn't design TextSpark to solve that problem anyway (it's intended as more of a creativity tool and accelerator rather than a replacement), but the question is a good one and possible solutions have a lot to offer in any AI-aided narrative construction.
I've thought about this problem a lot while working with AI text-gen for the last year or so and I have an as-yet untested model for how it might be done with existing technology.
The primary limitation on this today is AI model prompt sizes.
You can't prompt a model with a section of text that's longer than 1000-2000 words or so. That means you inevitably lose the context of anything that happened before that when you ask it for more words. This doesn't work when many novels range from 90k-200k words (and often much more).
If you want the model to produce work that's coherent with the rest of your story, it needs a lot more context--not only the things you've revealed in the story itself, but also it has to know things you haven't written yet (or at all).
These things include commonalities of story structure, character details, worldbuilding details, established tropes, and effective pacing. I've often said this could be handled by a combination of a well-trained model and self-referential comparison loops.
In this thread I want to provide a rough sketch of how I think it could work. My thinking here is still pretty nascent and completely untested, but I think it has merit as an avenue for exploration.
Here's my proposed diagram for a solution to this problem:
At the core of this solution are four main concepts: a bank of compressed models organized by type, a compression algorithm, a comparative analysis algorithm, and the text generating model.
The base of your story begins with the models, and this is essentially the same as when you're sketching out the details of a book in the planning phase. You have ideas for a plot, conflicts, characters, and a setting.
These are the things that go into drawing up an outline.
I'll use a very simple classic fantasy plot to illustrate my concepts here:
"Villain kills hero's family. Hero goes on quest to find MacGuffin he needs to seek revenge. Hero earns MacGuffin, finding a new family in his friends along the way. Hero kills villain with MacGuffin."
Your context compression algorithm would be a specially tuned text algorithm that pulls details out of a short section of text and goes through several steps:
- "What story elements are here?"
- "Is this a character, location, object, or event detail?"
- "Put the relevant detail associated with this in the model bank."
This might seem hard but really it's not with a sufficient training set of intentionally constructed examples. GPT-2 could probably handle it just fine. You're dealing with small sections of text and basic recognition.
GPT-3 could definitely handle this. This type of categorization is now billed as one of their flagship and approved use cases. Especially with tuning, it would be ace at this.
You don't need to add anything. Whatever is in the initial outline (even the teensy baby one I gave) is enough to get things rolling.
By feeding my example into the text compression you'd wind up with a context bank that had something like the following objects in it.
In characters you'd have (nested within arrays):
- A villain object: Killed the heroes family, can only be killed with MacGuffin
- A hero object: Family killed by the villain, wants revenge
In objects you'd have:
- MacGuffin object: Only thing that can kill the villain
In timeline objects you'd have:
- Main plot: Villain kills hero's family, hero goes on quest, hero meets friends, hero gets MacGuffin, hero learns friends are a new family, hero gets revenge on villain
If a general purpose model wasn't good enough to do this with specific prompting, you can tune sub-models for each model bank use case and invoke them after the top layer identifies the elements.
Now we have models for our story and can begin writing. Your textgen model needs to be tuned specifically to understand object metadata in a format you supply as part of any prompt you send it (this is cake).
You start the story with a special initializer prompt that's something like "Tell me a story about X" with the base model info included to guide the prompt. Then you start cycling that generated text back into the next prompt.
Every time you go to your textgen model to request more story, you get N options for the next 10-30 words and the feedback loop occurs. This is where your context analyzer model comes into play. Your context analyzer model is tuned differently from your compression model in that it's comparative rather than reductive.
It specializes in asking the question, "Does this text contradict these statements?"
This is also EXTREMELY trainable with prepared data sets. Again, I think GPT-2 could probably do an acceptably good job at this.
The analyzer model breaks down the events in your text similarly to how the compression model does but it does a yes/no comparison/evaluation for each aspect of the generated prose.
It would check if character actions, dialogue, or details fit an established character model, if location details match the location model, and if events violate the established timeline model.
It scores each of the N generations and selects the highest scoring of the N sentences to return as the next line of your story, ensuring coherence with the existing known plot details.
Once it selects the generated sentence that violates the fewest established details, it adds the sentence to the story and then re-runs the compression model on the same sentence. Any new details introduced by this text are added into the appropriate model in your bank. Timeline events are logged in your plot structure.
Your compression model would ideally be capable of identifying conflicts within the models as details are added, and if a hard violation occurs (conflicting physical details for a character, like blue eyes and green eyes) it needs to kick back to the prior step and re-run the sentence.
You continue with this process via cycling prompt loops just like we use for TextSpark or like I've written about elsewhere (https://textspark.ai/blog/scene-priming/). You simply grab the last 1000 words in the story for immediate scene context in your prompt to attempt to continue the story, and add model data to ensure coherence with the larger narrative arc.
With each cycle, the compression model builds your object banks as a sort of encyclopedia of your world, characters, locations, and event timeline and gets richer with each loop. But if done correctly, all of this still fits in the prompt.
You can save space here by using the context model to identify which relevant elements are present in the prompt and only grabbing that metadata + the timeline. You don't need all of the characters from your novel... just the ones referenced in this scene.
It might also be helpful to have a "chapter" mini-model be generated at the start of each chapter which retains scene specific details and archives itself every time a new chapter begins. That way you're tracking the details in the scene as they shift according to ongoing events.
But by constantly building a representation of every narrative element in your story and self-referentially checking in this way, you eventually come to the end of the original plot timeline you had established, and the story arc is complete.
Because you've been cross-checking with your model bank the whole way and selecting the best options, all character and plot details of your story should be internally consistent.
But some margin of error will probably still sneak in. So at the end you want to re-run the whole book through the same process with fully-baked model objects.
Because your model objects tell a complete story, you want to rate every paragraph within the chapter and story context and identify via your context model whether the paragraph still matches with what's in the model banks. Any that fall below a threshold get re-run and re-evaluated for coherence.
You repeat this until all paragraphs are above whatever context threshold produces internally-consistent stories and voila: your book is done, written according to outline.
Now, if you're thinking this seems computationally expensive and time-consuming, you're correct.
You need at least 3 deliberately tuned models and you're running a number of comparative functions and a compression step for every sentence. The comparative steps are the expensive ones here.
For each loop you have:
- Prompt metadata enrichment (2 AI steps)
- N=5 Text Generations (1 AI step)
- 2-4 comparisons per generated sentence (2 AI steps each, concurrent)
- Text compression (2 AI steps)
All but the first two are on fairly small prompt and response sizes, though.
If we assume a 10-word text generation and a 1000 word prompt size, let's say it's 10 seconds for the first and 25 seconds for the second (ballpark accurate). An average of 3 comparisons per sentence (at let's say 7 seconds each) happens concurrently across your 5 N generated sentences. Let's call that 20 seconds. And then another 20 seconds for compression before the loop starts again.
75 seconds to add 10 words to your book.
You're doing 15 algorithmic comparisons of probably around 1000 tokens, so 15k tokens, and the enrichment and prompt require another 3500 tokens on average (together). Figure another 1000 tokens for proper compression (on the high side but let's say that's what it takes).
This means you need 75 seconds and 19.5k tokens processed for each 10 words you add to your book.
With OpenAI's pricing model for GPT-3, if you want the best engine, it's $0.06/1k tokens processed. But let's assume we can get that down to $0.02/1k because of volume. It's now costing you roughly $0.39 per 10 words or $0.04/word for your book. That's still cheaper than a good ghostwriter!
At 75 seconds per 10 words, you're looking at a total 12,500 hours and $4000 to complete a 100k word novel.
But computers don't sleep.
And you don't need to give it more than a very simple outline to get going. And you can (in theory) have as many of these going at once as you can pay for. Your book would be done in 1.4 years and not need a professional editor.
In terms of writing time for a professional indie author, this is pretty slow. But for a professional tradpub author, that's not bad. I might even call it fast. Plus, every second you can cut down the computations speeds this up exponentially.
Furthermore, things change once you realize you can kick off 100 (or 1000) of these processes at the same time. That picture looks a lot rosier after the first year when you've staggered operations and you're generating two epic fantasy-length novels per week.
Now, just for comparison purposes, let's say you can shave 2 seconds off every computation assumption I made (very achievable). How does that change the picture?
Now it's only 56 seconds per 10 words. And people don't have always have patience for 100k words anymore anyway. So let's shoot for 70k for our novel. A nice, readable length. You only need 6,500 hours for that. Now you're down to 272 days, or roughly 3 quarters. That's not bad at all!
You don't have to shave too much time off your computations before you're doing a novel-length book every month or two, still around the $4000 price point. And that cost can be brought way down too if you own the hardware running this.
This opens the door to completely custom and original stories with whatever details you want to provide for less than you'd pay a decent ghostwriter, and you don't need an editor.
Lots of people will ask, "But why would someone want to do this?"
Well, an infinite number of potential reasons. Revenue generation is the obvious one, with publishing houses and individual authors and screenwriters in mind.
As for the potential entertainment value, it's massive: Read a story based on a concept you think is cool. Self-insert fiction. Highly specific genre tastes. Characters in stories whose lives mimic yours and who grow up alongside you.
$4000 might be a steep price point for that in a consumer market, but computation costs come down every year and an interested party could negotiate even lower costs at this level of usage (or buy the hardware themselves).
$200 + 4 weeks for a totally custom and well-written story doesn't sound so bad and is easily achievable.
And this is even before we start thinking about how you could expand this same model to generate custom cartoon narratives or 3d-generated movies via the same tech used in deepfakes and simulated personas. We're not more than a few steps away from this. This might be possible today, actually.
Text is the backbone and starting point for all media, so beyond the script (for a TV show, movie, game, or cartoon) it's just a matter of designing a complementary AI model to expand up from the generated plot into scene details and generation.
No need for actors, show-runners, illustrators, voice-overs, editors, directors, etc.
This might sound like science fiction to you, but we're less than 10 years away from the capabilities if not the reality. All the tech is in our hands today. Someone just needs to experiment with it and build the framework.
As for the obvious follow-up question of should we build it, well... I don't know. I don't think it matters if we should. Someone will, because it's cheaper than doing it manually, and it'll change our culture just like every other invention has.
I'd hazard that wrestling with exactly this question is why OpenAI is so bearish on throwing open the flood gates for unconstrained text-gen. It's not just fake news that's in play here. It's literally entire classes of jobs that could be gone/changed within ~10 years.
But the clock is ticking anyway because the conceptual baseline model is pretty well understood. It's just advancing processing speed by a few steps and building additional supportive architecture around it, like the model I'm describing.
The genie has been unbottled. Now we get to ride some lightning.
Board Director / CDO / COO / Digital Maven
3 年Interesting provocation. Who’s credited as the author? ?? #storytellingoutsidethebox
Technology Consultant, AI Engineer/Researcher, Data Scientist, Machine Learning Engineer, Software Architect, Technology Leader, Startup Advisor
3 年Very interesting!