How Far Are We From Being Able to Generate Whatever 3D Objects On the Fly?
Welcome to my bi-weekly newsletter, “I’ll Keep This Short ,” where I navigate the less-traveled paths of AI, building new insight beyond the banal, mainstream chatter.
Step Into a New Dimension
Walking across your living room floor, coffee in hand, you ready yourself to sit down in your nice, relaxing avocado-shaped easy chair for some well-earned rest after a hard day’s work of writing prompts and generating images.
Dalle-2 Prompt: “3d render of a chair that looks like an avocado digital art”
While you sit there drinking your very real coffee, staring off into space, probably what you don’t think to yourself is, “Whew, I sure am glad that this chair in fact exists in physical reality.”
But in fact that’s precisely where we’re at with the vast majority of AI-generated content on the internet today. We’re a heck of a long way off from creating actual 3D content on the fly. Even the avocado chair above, while it certainly looks 3-dimensional, it’s really a 2-dimensional rendering trained on previous 2-dimensional snapshots of 3D renders that a human did.
For those who have used 3D modeling software it’s likely imminently clear what I am talking about. 3D CAD software has been ubiquitous since the 1980’s as something used to model virtually everything, from furniture here on earth to furniture on the International Space Station.
Since I’m not clear on how familiar the vast majority of everyone reading this article might on the nuances of 3D objects vs. 3D pictures in 2D space, I drew a little demonstration to show what I mean when I am talking about the difference between illusory 3D objects and real 3D objects below.
If you rotate a 3D object, you should be able to see the other side of it, in some kind of software environment. If you rotate a picture of a 3D object, an illusory 3D object, you will see the other side of the picture frame, and the illusory object does not change.
So where the heck are we as a species in terms of being able to generate some sweet, sweet, real 3D objects? There’s got to be tons of uses for text-generative 3D objects, from being able to generate and 3D-print out your own personal toe-door-opener things you see in bars, to a plastic bust of Karl Marx that fits over the end of your toothpaste tubes, so that Karl Marx can spit toothpaste on to your brush every night.
What Peak Performance Looks Like
This is what peak 3-dimensional performance looks like. You may not like it, but this is actually the first actual 3D printed object that was created using mathematics - the Utah Teapot , first rendered in 1975 by a researcher at the University of Utah.
Short and stout, with a handle and a spout, when you tip it over, you realize it’s actually rendered via Bézier Curves rather than just perhaps a bunch of points manually configured by hand in a grid space. Bézier Curves are essentially parabolic lines defined by mathematical functions, like these. You can imagine how a congruence of several of these in a defined way can be used to create objets.
Let’s contrast this to an illusory 3D avocado teapot, as interpreted by Dall-E 2, just for kicks:
While cool, it’s a hallucination, without any real physical embodiment, that is to say, there isn’t really a 3D point cloud which dictates how those shadows fall and how that light bounces off of the surface. There would be no way to, “rotate” these on the screen, they are purely illusory 3D objects, not, “real 3D objects.”
The above gives us a foundational understanding for where 3D graphics came from in the first place. So how about generative 3D objects?
Enter Shap-E
Perhaps you’ve heard about Dall-E, how about Shap-E? Recently, a paper came out from OpenAI researches called Shap-E , which is a 3D object generator. From the paper, Shap-E is an improvement over a previous model called Point-E . Whereas Point-E modeled point clouds, Shap-E uses something called Neural Radiance Fields (NeRF) which represents a scene as an implicit function. Never mind what NeRF is for a moment.
What you get as a result of NeRF in contrast to Point Clouds is something like this:
As opposed to Point Cloud images which are very detailed like the following, but are lacking in realistic surface interpretation:
领英推荐
My Attempt At Running Shap-E in a Colab Notebook
I was able to render an image of a dog with a HuggingFace demo:
Mathematical Background
Point-E Math
I'm skipping the math section for the Linkedin Article. To view the math, go to the substack version of this article .
Point-E Result
So as a result, Point-E was able to generate images which are very detailed like the following Avocado chair:
Shape-E Math
I'm skipping the math section for the LinkedIn Article. To view the math, go to the substack version of this article .
Shap-E Result
Shap-E was able to generate images which were, "pleasing," “smooth,” and did not skip out on parts of the model as opposed to Point-E, like the following:
What About Just Rendering with Code with a Large Language Model?
As I have mentioned in a previous post, large language models have a problem with factual knowledge alignment , and this goes in particular for more specific, niche topics.
We can observe that the best of class, GPT-4 LLM as of May 2023 does not deliver even the simplest everyday object:
Create a house in OpenScad
Imagine trying to build an actual house with this technology. Gah! What happened to my roof? I appreciate that my car is dry but really it would have been much better to protect my living room.
There’s a Market for That