Three conversations about data
Like Scrooge, I experienced three conversations before Christmas that left a profound impact, along with the conclusion that I should wind down public speaking, and that we need to reconsider our company.
I happened to have conversations about data and AI from different perspectives with employees at three different large enterprise companies. The first company A is one of Sweden's old industrial crown jewels, one of the Wallenberg assets or a similar company. The conversation went something like this:
A: You say you are doing something different than data warehousing, but it all seems similar to me. I have watched your presentations, but I don't see the difference.
Me: <Explains for 30 minutes, conjuring slides from my pile, which has grown sizable over the years.>
A: Ok, I accept that big data, data factories, immutable pipelines or whatever it is called is a different processing paradigm than warehousing. But I still don't see how it could contribute to the orders of magnitude in efficiency that you mention.
Me: …
Every issue of the annual State of DevOps report has reported a 100-1000x span in productivity and quality KPIs between the top quartile and the followers. The report has now been published for a decade, so it is no secret that productivity in software engineering differs by orders of magnitude. If old industry enterprises are not the followers, who are?
In 2016, Google revealed that they produced 1.6 billion datasets / day. The other data leaders (Meta, Netflix, Spotify, …) are not quite there, but we can deduce that the datasets / day are counted in the millions, which means thousands per developer. In a typical enterprise Snowflake or Databricks installation, that number is more like 100-1000 in total or 1-10 per developer. Those numbers cannot just be cranked up by 1000x at will. "We will increase from 10 lines of SQL per developer per day to 10,000 lines of SQL." is simply not how things work. It might be hard to imagine what impact those 1.6B datasets have, but one might think that it at least would be obvious that something is radically different and that it translates to business value.
In the 80s, the American car manufacturers claimed that quality surveys indicating that Japanese cars had superior build quality were falsified. Their reactions were similar to the denials that we see among Swedish incumbents when software and data productivity KPIs are mentioned. In a way, it is human and understandable - if someone else is doing a particular job 100x more efficiently, what does that say about me, or my team, my coworkers, my suppliers, or some concept that I bet my career on?
When deciding the path ahead, companies in denial usually end up asking not for the best choices, but for safe choices and the most popular components. The ghost of data past holds them back to data methods from the 90s, using horses and carts in a world where leaders are motorised. The horses are shiny new and faster than before, or in the case of the lakehouse, Databricks's genius sales invention, a steel horse that is motorised but still trots like a warehouse so it won't move too fast and the old skill sets still work.
A few days later I had a conference lunch and spoke to a manager from company B, also an old Swedish crown jewel.
B: I see great potential in generative AI for internal analytics, since it can enable business people to query their data warehouse without needing to ask more technical people.
Me: I posted this: https://www.dhirubhai.net/posts/larsalbertsson_i-just-heard-this-wdyt-please-discuss-activity-7170412249428299776-NMM4. The reactions to the described system have ranged from "Brilliant, I want one of these" to "What? Why do the enterprises keep solving organisational problems with expensive technology?". In companies with cross-functional teams that include both business and technical people, teams are already enabled to get any data they need. Aren't you applying an expensive patch for a symptom instead of solving the real problem by rearranging people?
B: I know, some companies organise like that and it is more effective, but the separation of business and tech is the reality that we have. It can never change, and we have to live with it.
Me: But your US Pacific competitor X recruits from Silicon Valley, with a culture of close business and tech collaboration, and has solved this problem for real.
B: Yes, I know, but they have so many other problems that we can compete with them anyway.
So, even if the solution is known, the ghost of data present is so strong in holding them right where they are that making real progress is unthinkable. I do agree that generative AI for analytics does have merits in big corp settings and can be part of a real solution, but that's beside the point. What stood out to me is the relation to progress and the acceptance of a state of corporate mediocracy in an important capability.
领英推荐
Company X is growing faster than B, has 3x the revenue, and 20x market cap.
Next, a hardware vendor C spoke enthusiastically about the potential that generative AI brings. Traditional hardware vendors have not been favoured by all the migrations to cloud, but now they rejoice at the green field of opportunities, as long as they can convince customers that they need GPUs.
C: It's a Klondike out there! We are currently doing workshops with a legal sector government agency that wants a RAG solution.
Me: But is generative AI appropriate for legal use?
C: As long as references are included, it should be ok.
Me: A RAG includes a search engine. Are you saying that a traditional search engine is the part that is actually useful, and not the generative AI?
C: Pretty much, but everyone wants generative AI.
The ghost of data yet to come is so strong that the incumbents believe that they need and can utilise racks of GPUs. Since competence in turning data and AI into sustainable value is scarce, many buy the pixie dust. When it's rich companies, I don't mind so much, but when it is our tax payer money, it hurts inside. Each million wasted in schools, healthcare, or the legal system causes real harm. Like Klondike, the only groups that collectively earn a net profit are vendors - shovel salesmen, saloon owners, and server manufacturers. There is real value in generative AI, but use cases and return on investment are more rare than the hype indicates and building reliable products require systematic application-specific data engineering, which is out of reach for many buyers due to scarcity of data engineering skills.
So, after staring the three ghosts in the eye, some things are clear.
The conversations above are not unique events - I have had them many times, in different variations. Concise variants just happened to line up this winter.
I started a business with the intent to enable many companies to enjoy what the data leaders have, to go beyond data horses and carts to industrialised data processing. We have spent years trying to explain the data & AI divide - how much more companies could do with data and AI, the value it could bring, and how to make progress on that path. The ghosts told me that it is futile.
If we look at general software engineering, it took a couple of decades to go from the horse-and-cart age dominated by 4GL, low-code tools for democratised programming, Visual Basic, manual deployments, etc, to today's industrialised software engineering, with DevOps, continuous deployment, infrastructure as code, containers, GitOps, and other techniques that enable us to build automated processes, climb in abstraction layers, and spend less time with operational toil. In data engineering, we are currently in the peak of the 4GL phase, where "Is it easy to use?" trumps "What can we achieve?" We know what the industrial future looks like for data engineering, because the leaders are already there, but the future is so unevenly divided that most teams don't know that it exists. Imagine travelling back 20 years in time to try to explain that developers should deploy to production multiple times per day using continuous deployment pipelines with container orchestration.
10 years from now, there probably will be no need to explain the difference between data warehousing and industrialised data engineering, but right now, making clients see it and want to make progress is a necessary cornerstone for our company. The stone is loose, and the prospects of a sustainable business with this particular mission are therefore slim. So we are considering our options. I'll make another post about it, but to be brief for now we are open to any kind of suggestion.
We have managed to reach data engineering productivity KPIs normally reserved for the technical giants and achieved > 10x speed and cost efficiency compared to other data efforts we encounter. It has been really valuable in a few contexts with the necessary prerequisites in place, but they are too far apart, and we are now looking to put our capabilities to better use than we have done in the last few years. All kinds of options are on the table.
My conference presentations have been a primary marketing channel. It has served well in the past but no more. Speaking takes a lot of energy for me, and if it doesn't give return on investment, I will ramp that down. I might do occasional presentations that seem particularly fun, but not on a regular basis for the foreseeable future.
Sr. Data Engineer @ Urban SDK | Writing to help Jr. software engineers improve their skills and quality of life
1 个月Lars, I can only imagine your frustration.? Especially with you being an engineer that has helped technical giants reach the highest levels of data productivity, and with you knowing how to build solutions to help companies reach those levels of productivity.? It is sad to hear that you won't be doing presentations, but I understand. I first came across your presentations in 2020. I was a web developer at the time and didn't know what data engineering was.? Learning from your presentations motivated and enabled me to transition into data engineering in 2021. I have continued to study your presentations over the years. Your presentations have helped me grow as a data engineer, and I reference them frequently for designing the projects I work on.? Thank you for the valuable wisdom you have shared about building industrialized data factories.? I firmly believe that eventually, majority of companies will clearly understand the necessity and value of industrialized data engineering, although it may be 10+ years from now, as you stated.? Unfortunately, many companies will first go through much pain and suffering from trying to jump into AI without establishing a solid data engineering foundation. I wish you continued success!
Helping to save one million lives with Laerdal Medical
2 个月"Since competence in turning data and AI into sustainable value is scarce, many buy the pixie dust." - I'm stealing that quote, Lars! Thank you :)
CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!
2 个月Lars. I empathize with your feelings 100%. You and I have similar views on the data engineer process and how Agile/Lean/DataOps is the solution. Yet the current dialog is more tech, more suffering, more failure. I'd love to connect live and talk more!
Co-founder at Svipe
2 个月I empathize with you. The challenge with a value prop (any value prop) is that you need to find a customer for which the timing and organizational maturity matches. There will also always be the ghost of existing investments. Now, for the companies where there is no match you may (unfortunately) find a lot of interest, but it will never progress to an engagement. So, red herrings that will consume a lot of your energy. You may still achieve an impact with these audiences (which is of cource nice), but will see no revenue upside on your end. So, what to do? Some options are: 1) continue your existing business, but improve/change your marketing and sales 2) productize your offering so that it becomes more easily consumable 3) create a new product where your knowledge is embedded and where your approach makes a 10x difference. Good luck!
Co-Founder & VPoE @ Valo | Builder and doer: Infrastructure, applied AI, Data, Databases, SaaS, GCP
2 个月Thank you for thr great article and insights. Not only in this post but in general. We are building a new company that is data heavy. I believe that we are on good track, but I am afraid there is a lot I am missing for setting up the data foundations in the right way. Can you point me to resources that would be helpful for context where the data is not yet that big and people are still open for new, as well as the company structure is not yet rigid. There are signs of possible hyper growth in the future. How do I set our data journey on right tracks?