Tiny but Crucial Details of GPT-4o and Why Leaders Need Deep Technical Understanding
Understanding technical details is critical for leaders

Tiny but Crucial Details of GPT-4o and Why Leaders Need Deep Technical Understanding

As a teacher in executive AI subjects, a common question I encounter is whether a non-technical person could end up as an AI or tech lead. My answer is nuanced, and the GPT-4o presentation could help to understand it.

GPT-4o is a huge improvement over GPT-4, mainly due to: a) its real-time voice and image/video conversation mode, b) multilingual capabilities, and c) cheap price. If you didn't see the presentation, I recommend that you watch a few demos. Realtime translation or conversational speech would be a good starting point.

How does a voice chatbot work?

In its mobile version, the previous GPT-4 already had a voice mode. It uses text-to-speech (TTS) and a speech-to-text (STT) systems connected to a GPT-4 (text-only model) to build this voice-based experience. But what we saw in GPT-4o's demo is a multimodal AI that simultaneously combines audio, image, and text inputs. The "o" of GPT-4o stands for omni.

GPT-4o text-only model was released the same day of the presentation but not the full version. So a lot of users got confused. Nothing to blame here, they are really similar. That is the point of this article: the nuances you can sense with technical knowledge.

Both are so similar that Sam Altman clarified this week on Twitter that this is not released yet.

Tokens are the building blocks of language models. A tokenizer is trained to break down text into smaller units called tokens. Let's see an example.

A tokenizer working as you type. You can note that some tokens "change" over time to include more text.

This concept can be extended to voice, image, or video. For example, a sound is composed of many audio tokens: short fragments of audio.

GPT-4o stands out for its ability to work with multiple types of input and output tokens, such as text, images, and sound. This is what we call multimodality.

Because GPT-4o is generating audio tokens, not text that is sent to a TTS system, it can emulate laughter in a natural and very convincing way , even singing, but also creates strange voices, especially with non-English voices. GPT-4o is a totally different product, although the UX is similar to the previous GPT-4 voice system.

It seems like it "thinks in real-time". It is so fast and works directly with sound, so the latency is minimal between the user's ask and the answer. This, on top of interruption detection, makes GPT-4o quite magical.

But note that older alternatives like GPT-4-turbo were fast enough to give a real-time experience before GPT-4o was released. See vapi.ai as an example of real-time conversation using a more traditional STT → Gen AI model → TTS approach.

From the Vapiai docs, you can try their demo on the home page to see if you feel any difference compared to the GPT-4 demos. I am not affiliated with them in any way. I simply admire well-designed products.

Multimodality comes with some challenges: training data is scarce and there are more cybersecurity risks. See how GPT-4o is jailbroken into breaching OpenAI's safety rules with only one image (published one day after the release).

Another important point will be the price. Model use is priced by tokens. Because images, voice, and text are all tokens, we need to know how many tokens an image or a voice recording spends.

As far as I know, there is no public record of the audio token density per minute for GPT-4o. So I will use an image. A really small image (150px x 150px) costs 255 tokens.

So even though the announcement highlights their price reduction, we don't yet know how expensive GPT-4o will be for voice chatbots compared to a traditional approach. I would not bet a lot on either option. GPT-4o is cheap, but sound will spend more tokens, and STT/TTS services are getting really good and inexpensive by the month.

And what about technical leadership?

This is what happens when you have enough technical understanding:

We cannot compare with Andrej Karpathy, but this is a great example of what technical knowledge brings to the table and how that helps our decision-making. Let me go through an idealized thought process under two different lenses:

Standard takeaway

  • GPT-4o is now real-time.
  • GPT-4o is the main/only model to be used for voice chatbots.
  • GPT-4o is multilingual, so it can translate now and/or is better for any multilingual use cases.

Vs Deep knowledge takeaway

  • GPT-4o is multimodal, and that enables real-time experiences.
  • Although there are other valid approaches to consider that are fast enough to get real-time experiences, such as GPT-4 (or even 4o) + STT + TSS.
  • For text-based tasks, GPT-4o probably improves on its predecessor (has more training data, a better tokenizer, etc.).
  • But for voice-based use cases, some uncommon languages could have errors and artifacts in voice generation due to a lack of data. If GPT-4o is deployed globally, test were required. Maybe a standard TTS+STT system ourperforms multimodal AIs.
  • And so on...

As you can see, the technical point of view can add a lot of value to your decision-making.

The point is not: "tech folks are better," but that you need to really understand both sides of the game in a deep way to get the right insights. This idea clashes with some common attitudes among decision-makers. More on that later.

Deep knowledge makes you more prudent, generates more nuances, and leads to more complex insights. You are seeing the Dunning-Kruger effect in action.

It's no surprise that seasoned tech leads tend to prefer old and proven technologies and don't blindly follow the new flashy thing.

It's also no surprise that "quick-and-simple" analyses or "bite-size" content targeted at executive audiences tend to be extremely poor and trigger bad decision-making. You cannot fit nuances into a two-minute reading report. Another topic for another day.

Multilingual Capabilities

The GPT-4o demo was really strong in multilingual capabilities. OpenAI used far more training data in non-English languages. This improves the performance of the model, but something really important was missed: tokenizers.

Tokenizers penalize less popular tokens, therefore non-English languages. See an example tokenizing "cien ca?ones por banda " in Spanish and English with the old tokenizer:

English has a better word/token density: 1.8 tokens per word vs 1.0 token per word. Non-English words tend to be split up.

Note that "cannons" is one token in English, but 3 in Spanish ("ca/?/ones"). This has huge implications. It is so important that it was highlighted by Greg Brockman (co-founder and president of OpenAI).

Fewer tokens means cheaper inference costs. In some languages up to 4x fewer. But this improved tokenizer also improves the performance of GPT.

GPT (and any other AI based on Transformers) has an attention mechanism. This combines tokens to give them meaning according to the context as excellently explained in this Financial Times piece : "interest rate" is different than "I have no interest in politics" although "interest" is the same token in both phrases.

In languages other than English where words are split into two tokens, the attention mechanism needs an extra effort to "join" those tokens:

"Ale" and "jandro" are being "joined" with the atte. mechanism.

Therefore, by changing the tokenizer to be multilingual-friendly, we have just achieved:

  • Lowering the costs of other languages that are less efficient than English in tokens/word density. Those savings add up to the reduced price per token of GPT-4o.
  • Allowing transformers to improve their performance because it's easier to process sentences. They don't need to "waste" attention stitching split words.

There are a few interesting technical takeaways about this, but I will keep them for the "A Final Exercise" section.

So if deep tech knowledge is so useful, why is it not so widespread among decision-makers? Many reasons, but I will highlight one: a cultural problem.

The two culture's problem in technical leadership

There is a persistent attitude of contempt towards technical knowledge from "business" or "strategy" roles and vice versa. This is really harmful to organizations. In my experience, this attitude could take three forms:

1. Strategists vs. worker bias: "I am a leader, I don't know how to program/develop/do this stuff because I make strategy/think big."

Leadership or strategy defined as the opposite of implementation is a bad idea. Better-informed leaders make grounded decisions, and their stakeholders and teams benefit from that. HBR's If your boss could do your job, you're more likely to be happy at work is a good resource that explains more about this.

2. Job description strictness: "This is not what I have to do."

That is (partially) true. But knowing how it is done and its fundamentals is part of a leader's job. This shapes the market and its behavior and overlaps with a "pure" technical job description.

3. Opportunity cost blindness: "I have no time to learn all those things."

Which really means technical knowledge is not perceived as valuable enough to prioritize it. I hope this article helps you see how much can you get from getting into technical stuff.

Navigating hyped waters

So, how technical should technical leadership roles be? There is no perfect answer, but we can focus on what happens if they are not technical enough.

Do you remember the 2014-15 Big Data hype? I remember. I remember terrible bets and purchases of expensive solutions that were thrown away in one or two years when the main stakeholder left the company or the increasing cost did not justify more investment.

We should learn from those crazy years because it is happening again.

Deep knowledge is the most important skill for technical decision-makers in an overhyped market. It will save you a ton of money and a few bad surprises.

Gen AI is different from other technological shifts: more complex, faster, and probably more impactful.

There will be no manuals or use cases to copy and paste in the short term, and they will be outdated really fast. Worse yet, there may be a lot of them trying to sell you things.

In early-stage phases of innovation like this, exploration, MVPs, iterative testing, internal know-how (yours!), and expectation management are key elements of success where traditional consulting struggles to deliver. This is a good topic for another article.

Learn how to develop with Gen AI, especially if you are not a developer.

You should learn things that feel useless at first, but they are worth it when you dodge a bullet of a bad use case, save 10x the price of an over-engineered solution, or reduce your team's rotation.

Gen AI has a nice characteristic: it is quite no-code friendly. You can understand a lot of its fundamentals without getting lost in other details.

Prudence is a side effect of knowledge. You will start to discard some perfectly presented and marketing-shiny ideas. Congratulations, now you have criteria.

Some peers won't understand that. Use this to your favor; explain your reasoning. Show your deep know-how. It is a clear differentiator, but only if you can explain it. And as a teacher, trust me, you need to understand something in order to explain it.

A final exercise

Let's sum up the main insights we got from the GPT-4o presentation through my "technical lenses" - though I may be mistaken. Try to determine how many of these insights you would have discovered on your own and which ones required more technical nuance for you to get into.

  1. Multimodality will be the next generation of generative models, but not all voice-based and real-time use cases will require them. Probably, we will see a cost-performance stratification.
  2. Non-English languages and specific accents could struggle in voice generation due to a lack of multimodal datasets. This is noticeable in some demos released by OpenAI.
  3. There is a temporary advantage (maybe a few months or years) if you have or collect conversational audio, especially in languages other than English. The performance in less popular languages and accents could be a competitive edge for many clients if you get it right.
  4. Multimodal AIs learn background noises, errors, filler words, hesitation markers, etc. So invest in data quality: stereo audios with channels separated, better audio quality, transcribe and enhance the data, and so on.
  5. Quality benchmarks should be split by language and accents if you work in different geographies.
  6. When comparing costs, we will need to differentiate among languages. Cost per token and words per token density are critical, and the tokenizer affects that, whether it is a multimodal AI or not.
  7. Models with more sensible tokenizers and training datasets will have a significant advantage in multilingual use cases. The size of the model and the training dataset size will not be the only factors in quality.
  8. We will need to create evaluation datasets for those languages. Public evaluations are biased in favor of English.
  9. Real-time multimodal interactions have a lot of challenges to evaluate: cost and cybersecurity are important ones. We do not have data yet to know which is cheaper and if it's worth all the risks that multimodality conveys.
  10. Expect a lot of jailbreaking and cybersecurity risks in multimodal models at the beginning. There are more ways to "trick" a model that inputs rich information like sounds, video, or images.
  11. There are many other multimodal use cases waiting, such as text-to-3D and image-to-3D for game asset generation, or video-to-text for security footage analysis, to mention a few.
  12. Some of these will be accessible in a few weeks when "full" GPT-4o is released. If you have access to or can build a "niche" multimodal dataset, you can create an interesting business case there.

Don't worry. It is not important how many of them you would have discovered. The goal of this article is to help you understand in which way technical and business knowledge should be combined to make a good leader and strategist.

It is not possible to know how much you don't know. So learn.

So learn. Even apparently out-of-your-scope things. Especially that. Mix tech and "business" know-how. It will not hurt and probably will help a lot.

Was it interesting? Did you learn something new?

This is the first article in the "Tiny but Crucial" series. For the next one, I could use your help. I will leave a short survey (two questions) in the comments to analyze another strong bias in AI-invested decision-makers.

I have left a few interesting topics for future articles. Let me know what you would love to read, and follow me if you want to see more of this.


Alejandro Vidal

Generative AI Development & Strategy

6 个月

Two-question survey. It will help me gather some real data to explain a bias to decision-makers. I will release the data and the analysis in a few days. Thanks! https://tally.so/r/3xd595

回复
Joan Llop Cabo

Comms | Journalism | Data

6 个月

Super interesting!! In my case, I wasn’t aware of words per token density issue, and the implications this may have. Waiting to reading all these topics you are highligthing for future posts!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了