Whiskey Tango Hugging Face
WTF is Hugging Face?
As we were all headed into the 4th of July, a non-profit called Hugging Face was having some big moments.? I’ll explain WTF Hugging Face is later. GitHub for LLMs is the shorthand.
What does that mean?? The Open Source Software ($0 license to use) movement was announcing its presence as a major force, and as an ally of enterprises, on the GenAI stage.?
Key moments were from IBM Research on GenAI in mid-June, and most importantly, the Databricks “Generation AI” conference at the end of June, with live perspectives from Marc Andreessen and Eric Schmidt .
Hugging Face made important appearances at both. Hugging Face - yes this is their logo ?? -? was giving GenAI street cred to some of the most foundational enablers of ongoing industry innovation.
What does that tell us? Essential answers to questions that I listed from my colleagues in my most recent article , especially about:
Hype cycle
We got insights into how robustly companies with sophisticated IT capabilities are determined to move GenAI from an amazing consumer app into the heart of business operations and customer-facing experiences.? This includes what GenAI risks they are mitigating and how.
Hugging Face said 15,000 companies have posted 300,000 models, 50,000 datasets, and 100,000 open demos on their free service already. Essentially all of it is open source.
Insight - The pitfall is not that GenAI is mere hype, it's whether the leading SaaS providers of GenAI can make the accompanying hype strong enough to cause enterprises to trade away their optionality in order to get the goods now.
Market power
We saw the Open Source community rallying to repeat the moves that led us away from a universe of dull software oligopolies and into the explosion of innovation that we’ve enjoyed over the last 25 years.
Insight - The names of people and communities in these late June events have a critical, common thread: Open Source battles against oligopolies. In those battles, large enterprise IT leaders are with Open Source.
The Arc of Digital History
Keys to the Walled Cities - The Open Source Software movement saved us from a stagnating computer industry multiple times from 1998 to 2015.? The aforementioned Marc Andreessen, IBM, Eric Schmidt, and Apache-Spark / Databricks were key players.? [See afternote below.]?
Cloud Cities - More recently, the Cloud giants significantly co-opted open-source software by offering all kinds of open-source frameworks as easy-to-consume services.
As Google is to Search?
Now, as the cost of training a great GenAI model goes from the current $10M quickly up to $1B, we are on track to face an oligopoly of Cloud companies that can dominate and require everyone else to pay to consume outputs of their models.
This outcome would make GenAI look like something familiar to Marketing people, which is Google Adwords. There is no open-source alternative to paying Google to place your link atop Google search results.
IBM refers to this as consuming GenAI via an API. You don’t own and operate it, so you pay what you must. More generously, we might just call this SaaS. Same economics.
Or Open Source Rides Again??
Large enterprises, such as JPMC , view the oligopoly outcome as an unacceptable risk. They are willing to delay adopting GenAI into their homegrown digital experiences if more time enables the alternate, open-source outcome instead.
(We all know their developers and marketing people are already using GenAI SaaS constantly for personal productivity, despite increasing prohibition by employers .)
So when you see the Open Source Software posse come together as we did a few weeks ago, you have to wonder if they can do it again.
We can see from their behavior that IBM and Databricks, who are pretty sophisticated in big data, software deployment, and AI, are one generation out of date on GenAI to pull it off. Old guard.
The New Guard - But in rides Hugging Face to breathe life into open-source for GenAI. Open-source dynamics are critical to solving the problems of excess vendor power and to creating confidence for enterprises to proceed with the widespread integration of GenAI into their cores.
Walls of the New Cities
$Billion with a $B?
If the cost of training an effective LLM for your own enterprise using your own data is $10 million or $100 million, then many enterprises can and will build their own.
If the cost is $1 billion or more, as Eric Schmidt described to Databricks, then they won’t.?
So, large enterprises are taking a moment to explore if the open-source models available on Hugging Face can meet their needs, with help from IBM or Databricks or AWS, etc to train, test, and deploy.
Rogue Leaderboard
Hugging Face keeps a leaderboard, where you can see that the top-performing open-source models on multiple benchmarks have about 40 Billion parameters as of this writing. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
At the moment, the “Frontier” closed-source models from OpenAI and Google are denominated in 500 Billion parameter increments. Most observers settled on 1.5 Trillion parameters for GPT4.0.
Entry Gates for Enterprises
Large company adoption seems to have three types of primary gates.
Data Protection
One is data custodianship.? They certainly are not going to give their proprietary customer, product, and operational data to train an LLM that everyone else uses.
They also have to ensure that they don’t introduce personally identifiable information into the models. And for that matter, they would like to curate and clean up whatever data they use to train any model so they are passing along best practices and not bias, hate, profanity, etc.
So, if Hugging Face has pretty good open-source LLMs that these companies can make a copy of to train further using their own data, they greatly prefer this.
The question is whether pretty good is good enough in the face of GPT4 and PaLM.
Correct Answers
So the second gate is the ability of the model to give correct answers.
One way to get an LLM to not make up information is to make it smarter and smarter with a combination of larger models and larger sets of training data. That’s the Cloud giants’ game. Trillions of parameters and rising.
But there is a better way. This is to point the model’s attention to definitive sources. This already happens with automatic prompt enrichment on ChatGPT plus Web and with plug-ins in the Azure world.
The open source project, LangChain , has quickly become the practitioner-preferred means for building up a custom store of authoritative source data that the model uses entirely to confine its answers or as an assist.?
Eric Schmidt mentioned another idea, presumably protected by blockchain technology, of having definitive sources of truth available to any model when it loses track of reality. We’ll see about that.
领英推荐
Accountability
The third is governance.? Large institutions have to show, not that they are perfect, but rather that they have made best-known efforts to uphold compliance of many kinds. And they have to show shareholders they are optimizing for cost, as in ROI
Compliance asks for full traceability of what data they used to train a model, evidence of competent testing and ongoing quality control, and the ability to investigate an incident after the fact.
If they are just passing queries into a model owned, trained, and operated by a Cloud or SaaS giant, they may not meet their compliance requirements. This is both fuzzy and a strong negotiating point for enterprise buyers. Limiting your liability exposure gets the CFO in your corner.
Speaking of CFOs, so long as enterprises are consuming answers from someone else’s model, they can’t really get a fix on their future costs. This is already a problem with Cloud computing, but sophisticated companies use a combination of commercial terms, private clouds, and monitoring tools to try to make costs predictable.
Private clouds and trustworthy monitoring tools are made possible by…? You guessed it. Open source solutions starting with Linux.?
So enterprises want that combination for GenAI.? Are they on track to get it?
Scorecard
Enterprise adoption is the colossal third wave. The first wave was developers. The second wave was simultaneously marketing and consumers (workers.)
Let’s check in on how the Gen-AI-powered oligopoly model is working out in the first two waves.
Business Use Case #1 - Software Developers
Over 70% of developers use the same, open-source development environment: Microsoft Visual Studio. And they all use GitHub, very few pay for that. Let’s call that all free.
Enter GPT-powered co-pilots that earn $200 a year potentially for 21 million Visual Studio users or maybe it's for the 75 million GitHub users. $4.2 to $15 Billion per year. Nice little business for Azure.
Yet, overall, while they are willing to pay, developers are not going to get locked into a walled city. When you listen to Microsoft, Google, and AWS, they all continue to say they will not (dare not) try to do so. Instead, they actively promote and contribute to open-source alternatives to their own products. “Stay in my city and I promise not to lock the gates.”
Score one against Oligopolies.
Business Use Case #2 - Marketing
Some marketers would be able to name one open-source marketing tool if they really thought about it. [WordPress for websites.]?
Marketing is essentially fully captured as a SaaS model. But it is highly fragmented. Remember the MarTech 11,000 ? Rember that enterprises are running hundreds of different SaaS applications ?
Only the very largest full-suite providers, like Adobe and Salesforce, will have the resources to play SaaS and open-source GenAI models off each other. Virtually every other marketing SaaS provider is busily building features that use OpenAI as their back end.?
So the Cloud giants can count the Marketing use case as firmly leading to oligopoly.? It will be one to two dominant LLMs monetizing the MarTech 11,000, plus a couple of SaaS giants with full-featured marketing suites who still rely heavily on the same dominant LLMs.
Clearly, consumers too, as usual, are happy to run into the city of ChatGPT.? “What walls? I’m just having fun!” ChatGPT is breaking up Google’s search monopoly…into an Oligopoly.
Score one for Oligopolies.
Breaking the Marketing Oligopoly?
The interesting new arrival on this playing field is a segment called CDP, Customer Data Platform.?
Fragmented customer data scrambles your understanding.? If you run everything on Salesforce or Adobe, they include CDP, will integrate the data, apply all manner of GenAI to help your teams and measure your progress.
Otherwise, you need a CDP if you want to see customers clearly.?
CDPs may span across enough of the MarTech 11,000 to get enough scale to compete with Salesforce and Adobe, and enough sophistication to use open-source GenAI models as their back ends. Old Guards Oracle and IBM are offering CDPs. And growth companies such as Mixpanel, Segment , and Treasure are doing well.
Now You Know
I can assure you that OpenAI, the Cloud giants, Adobe, Salesforce, Oracle, IBM, and these new CDP players know exactly WTF is Hugging Face!
More to come.
...
Afternote: 6 Keys that Unlocked Walled Cities
Every several years, an Open Source Software movement rides to the rescue. The forces of scale and first-mover in technology combine into colossal advantages for the first provider to gain massive adoption. Then, they wall us in.
This dynamic, called Masters of Scale , is so profound, as Eric Schmidt reminded us in his chat with Databricks, that OpenAI was able to raise many billions of dollars without any kind of monetization plan. Usage at scale is what matters.
When this happens, the users willingly go into a walled city where life is great but where prices are decreasingly affordable and the city operators are decreasingly creative. The cost of leaving is also very high, especially if the only place to go is another walled city.
The world of technology suffered this under the original IBM, then again under the original Microsoft.?
(Origin Story of Open Source: computer programming languages have always been or become essentially Open Source, and companies who created or supported them, attempted, if anything, to monetize tools and infrastructure around them.)
Here are the key times when an Open Source movement threw the keys into walled cities and invited its citizens to live in freer, if slightly less organized, cities.?
Netscape 1998 - When Andreessen’s Netscape lost the browser wars to Microsoft, Netscape helped found both the Mozilla and OSI foundations in 1998. (The Apache Foundation started the next year, and became the furnace of Open Source projects.)
Linux 2000 - At the end of the year 2000, IBM announced it would invest $1 billion, which used to be a lot, into Linux as a counterweight to dominance by Microsoft but also to the frustration of proprietary Unixes from Sun, Silicon Graphics, IBM itself and others.
Without Linux (and Apache), there would be no cloud computing as we know and enjoy it. IBM, with Redhat, is a primary commercial supporter of Linux today.
Android 2007 - The walled city of Apple IOS, where many of us remain blissfully trapped, was so spectacular that the world was genuinely headed for a monopoly. In rode Eric Schmidt at Google. Were it not for Samsung’s harnessing of Google’s open-source Android, now joined by Google Pixel, to push iPhone innovation to this day…well, the truth is it’s a duopoly and new phones are expensive but at least are innovative.
Apache Hadoop & Spark 2010 - Oracle was so dominant and so expensive in relational database technology, that only an open-source initiative could break through to address the growing needs to process, store and analyze the flood of new data from streaming video, website click streams, IOT sensors, sensors built into every part of the IT stack, etc.
Guess where the commercial home of Spark and its successful offspring, especially Delta Engine, is today.? Databricks. Founded by some of the professors who created Spark.
GitHub 2012 - That is when it reached a million users. Today, this free website powered by open-source versioning technology is used by essentially all developers in the world as a code repository and exchange. It's owned by Microsoft now. It is unlike the others on this list because there was no commercial alternative to break from. The need emerged from the developers themselves, and this company filled it with a hybrid open-source and freemium service.?
You can see why Hugging Face is useful as the GitHub for GenAI. Its both built-for-purpose and separate from the ultra closeness of OpenAI and Microsoft.
Docker Engine and Kubernetes 2015 - VMWare dominated virtual machines from 1998 and was wildly successful at separating software from hardware.? Without virtual machines from VMWare, there would be no elastic nature to the Cloud. But, there were no good alternatives, and it was expensive. By 2015, developers had realized the open-source Docker Engine was superior.??
Today, you hear a lot more about Kubernetes, which orchestrates containers, most of which are Docker containers, and any others are also open source. Google Cloud released Kubernetes as open source in 2015, one could argue, as a counter to the growing dominance of AWS Cloud.
Which takes us into the world we know today.
Also, McKinsey & Company's approach puts top priority on enterprise data security, with Cohere. ie also along the same lines. https://txt.cohere.com/cohere-and-mckinsey/
CIO Surveys by MIT yesterday, along the same lines. https://www.technologyreview.com/2023/07/18/1076423/the-great-acceleration-cio-perspectives-on-generative-ai/
Chief Customer Success & Product Officer; Business Development; Operations Management; Go-to-Market;
1 年?? Very interesting perspectives. Thanks for sharing.
Project Development Manager at Zeppelin System India Pvt. Ltd (Alpha Project Services Pvt. Ltd).
1 年Please add me to your network