Your AI dreams are coming true.
How do AI models actually work?
At the highest-level and in simplest terms: AI models transform data stage-by-stage (such as the raw text of many stories) to higher and higher level “ideas” (such as "this is a fairy tale about a boy climbing a beanstalk”) and then running this same process in reverse to transform these ideas back into the same input data. This is shown in the bowtie picture above.
These AI models are created starting from random (i.e. literally knowing nothing) and then continuously tweaked such that it gets slightly better each time at outputting the same data that it is given as input.
What use is that you might say?
This is useful because the model and training algorithm is set up specially such that it cannot just copy the data from the input to the output, but it is forced to gradually transform the data to "ideas" before it can do this successfully. While it is being tweaked repeatedly many times to slightly improve it's ability to output the input, the essence of high level "ideas" start to emerge (by themselves) deep within the model.
Once the model has been built and high level ideas have been deeply embedded, new partial inputs can be provided (such as the user asking a new question) and the embedded ideas are invoked to generate new data such as new stories or new pictures. Techniques for how to do this part have been the core of AI research development in the previous three years.
The explanation outlined above is common to how all modern large AI models work like ChatGPT for language or Midjourney for images.
ChatGPT and Midjourney (among notable others) made a surprising impact on public perception when they were released, but they won’t radically change the world as they are. Now though, we are clearly heading into a very different future, because powerful and practical multi-modal AIs are upon us. See GPT-4o also from OpenAI that just released.
领英推荐
Multi-modal means that it does not just take text or audio or images as input and generates an output, but instead takes several "modes" simultaneously (e.g. text and audio and vision) while seamlessly interpreting across them to generate output across these same modes. This is a natural step in AI development and was bound to happen because of the fundamental nature of how these models work as explained in the bowtie. “Ideas” are the same regardless of the form of the data, so being able to seamlessly convert from text, pictures, videos or sound to ideas and back again has long been an objective for AI researchers.
This means everything will change. Your world, my world and certainly your children’s world will be entirely different in the future. Move over internet, you have nothing on this disruption.
Why? Because not only will AI models accelerate specific tasks, but they will be able to complete high level objectives in entirety, and potentially with no end. Resources, not competence, will become the limitation. The most challenging “mode” still remains for now: movement. This is being researched by several companies as I’m sure you have seen, with Google DeepMind and Boston Dynamics at the forefront. It is only a question of when movement will be included in multi-model models, not if.
Once multi-model AI models also include a physical movement mode based on a competent robotic body, then our AI dreams come true. Or is that a nightmare. That is really up to us to choose.
What do you think, how will we choose?
Dear Chi you have the capability of describing in a nutshell someting that seems to complicate to understand to the most. Very interesting and clear.....of course we will chose new pseudo han to help.us in our dauly life.