Redefining Limits: How Breaking Barriers Lead to Billion-Dollar Ideas
A baby and an elephant reading Digital Reflections

Redefining Limits: How Breaking Barriers Lead to Billion-Dollar Ideas

December 14th, 2015 was a sunny day in Santiago, Chile. I had arrived in Santiago the day before to attend the International Conference on Computer Vision (ICCV). I knew Yann LeCun was an invited speaker and I was looking forward to listening to his keynote presentation.

After discussing the latest developments in the field, LeCun unveiled an AI that could answer questions about an image. The demonstration left us in shock and awe.

The human asked:

- "Is there a baby in the photo?"

And the computer responded:

- "Yes"

while the accuracy of the prediction was displayed on the screen.

Human:

- "Where is the baby standing?"

AI:

- "bathroom"

As we looked on, we could see the picture on the screen of a baby in a bathtub, confirming the accuracy of the AI's response.

Earlier that year, Professor Fei Fei Li from Stanford had presented an AI at the TED conference in Vancouver BC, capable of creating captions for pictures depicting planes and elephants. She referred to it as "one of the first neural networks to integrate vision and language ".

Both of these events marked important milestones in 2015, as they brought the state of the art in machine learning to global audiences. In many ways, 2015 was reminiscent of 1984, when Steve Jobs launched the Macintosh, and the lights in the auditorium dimmed as Chariots of Fire by Vangelis played in the background.

- “Hello, I am the Macintosh. It is sure great to get out of that bag!”

Like 1984, 2015 was a time of awe, wonder, and possibility.?

According to legend, when Andrej Karpathy shared his progress in caption generation with his Ph.D. supervisor, Fei Fei Li, she challenged him to do the reverse: to go from captions to images. Though she may have been half-joking, this opened the door to the development of systems that can generate highly detailed images from text inputs. These breakthroughs have given rise to the emerging field of AI art, which has captured the imagination of artists and technologists alike, with products such as OpenAI’s DALL-E , Google Imagen , and Midjourney . In principle, the idea of a computer generating images from the text may have seemed like an impossible task. After all, creativity is a quality often thought of as inherently human. How would a computer be able to be creative? However, looking beyond our beliefs, there is nothing wrong with that question: if a computer can generate text from images, why can’t it do the reverse?

This is an example of thinking in first principles. Many of the things that we believe in we have accepted as true because someone with authority has said so. Kids would ask questions such as: why do we sleep? And a reasonable answer would be to say: because if we don’t sleep we would die. Yes, but why?

Drilling down on the why can take us to an unexplored territory where there are no authorities left to tell us why the world works the way it does. It is lonely down there, but also like in the times of the wild west, it is a territory full of possibilities. Start-ups need to spend more time in that territory. Asking and questioning assumptions. Trying new things, even if they lead to failure. Otherwise, we are just repeating the formulas of people who have come before us. Paraphrasing Tim Urban, this is the difference between a chef who tries new things and learns how ingredients and cooking techniques can produce amazing results and a cook who simply follows the chef’s recipe.

Here are three of the important milestone ideas that made image generation possible:

  1. Stable Diffusion: a class of probability distributions that have certain stability properties and are used to model heavy-tailed data. Think of the way cream diffuses in coffee. There is a natural progression to the way this happens. Here’s a non so technical explanation of how diffusion works
  2. Latent space representations: this is how AIs represent text, images, and audio (otherwise known as high dimensional) into a vector representation that captures its underlying structure (its essence). An AI can then generate images that are similar to the training data by sampling from the latent space. Combine this sampling with a stable diffusion process and you are on your way to creating images!
  3. ?Visual-Semantic Alignment: this is a set of techniques to translate from text latent spaces to visual latent spaces and vice versa. Networks can be trained to achieve this. A particular type of network that does this really well is transformers !

In my opinion, the next billion-dollar idea won't come from applying an existing foundation model in a clever way. While that approach certainly has potential and can lead to profitable business ventures, I believe that truly game-changing ideas will come from pushing the boundaries of what is possible. By questioning assumptions and approaching problems from first principles, we can overcome barriers that were once thought to be insurmountable.

What do you think the next billion-dollar idea will be? Let me know in the comments.

If you liked this article subscribe for more!

Shay Porat

IT Operations & Infrastructure Director | Gateway Services Inc.

1 年

Super interesting!

Uzair Hussain

AI Researcher at University Health Network

1 年

Very cool summary! A reasonable assumption for an AI is that there exists one objective reality, R. Then text/symbols, neurons firing, pixel values, etc., are different viewpoints or “bases” to describe the same R. Each basis is good at capturing certain aspects of R, a text description would provide very high-level relations (baby-bathroom), but the first few layers of a CNN, for e.g., would be capturing relations between curve segments. A physics example of this is how observers travelling at different constant velocities will see different combinations of electric and magnetic fields. With tensors it is revealed that the different observers are just observing different faces of one fundamental object which is the Electromagnetic (EM) tensor, from which one can generate the E and M fields for any observer if known for one observer. Likely there exists a general model that is able to accommodate data in enough bases and create an asymptotically-complete internal understanding of R, akin to the EM tensor. Once this internal representation is formulated in the correct maths (which may not exist yet) it could reveal things to us that are much more profound than EM unification (which is already extremely high on the amazing scale).

要查看或添加评论,请登录

社区洞察

其他会员也浏览了