登录查看更多内容

Three conversations about data

Lars Albertsson

?????? Founder at Scling

发布日期: 2025年1月7日

Like Scrooge, I experienced three conversations before Christmas that left a profound impact, along with the conclusion that I should wind down public speaking, and that we need to reconsider our company.

I happened to have conversations about data and AI from different perspectives with employees at three different large enterprise companies. The first company A is one of Sweden's old industrial crown jewels, one of the Wallenberg assets or a similar company. The conversation went something like this:

A: You say you are doing something different than data warehousing, but it all seems similar to me. I have watched your presentations, but I don't see the difference.

Me: <Explains for 30 minutes, conjuring slides from my pile, which has grown sizable over the years.>

A: Ok, I accept that big data, data factories, immutable pipelines or whatever it is called is a different processing paradigm than warehousing. But I still don't see how it could contribute to the orders of magnitude in efficiency that you mention.

Me: …

Every issue of the annual State of DevOps report has reported a 100-1000x span in productivity and quality KPIs between the top quartile and the followers. The report has now been published for a decade, so it is no secret that productivity in software engineering differs by orders of magnitude. If old industry enterprises are not the followers, who are?

In 2016, Google revealed that they produced 1.6 billion datasets / day. The other data leaders (Meta, Netflix, Spotify, …) are not quite there, but we can deduce that the datasets / day are counted in the millions, which means thousands per developer. In a typical enterprise Snowflake or Databricks installation, that number is more like 100-1000 in total or 1-10 per developer. Those numbers cannot just be cranked up by 1000x at will. "We will increase from 10 lines of SQL per developer per day to 10,000 lines of SQL." is simply not how things work. It might be hard to imagine what impact those 1.6B datasets have, but one might think that it at least would be obvious that something is radically different and that it translates to business value.

In the 80s, the American car manufacturers claimed that quality surveys indicating that Japanese cars had superior build quality were falsified. Their reactions were similar to the denials that we see among Swedish incumbents when software and data productivity KPIs are mentioned. In a way, it is human and understandable - if someone else is doing a particular job 100x more efficiently, what does that say about me, or my team, my coworkers, my suppliers, or some concept that I bet my career on?

When deciding the path ahead, companies in denial usually end up asking not for the best choices, but for safe choices and the most popular components. The ghost of data past holds them back to data methods from the 90s, using horses and carts in a world where leaders are motorised. The horses are shiny new and faster than before, or in the case of the lakehouse, Databricks's genius sales invention, a steel horse that is motorised but still trots like a warehouse so it won't move too fast and the old skill sets still work.

A few days later I had a conference lunch and spoke to a manager from company B, also an old Swedish crown jewel.

B: I see great potential in generative AI for internal analytics, since it can enable business people to query their data warehouse without needing to ask more technical people.

Me: I posted this: https://www.dhirubhai.net/posts/larsalbertsson_i-just-heard-this-wdyt-please-discuss-activity-7170412249428299776-NMM4. The reactions to the described system have ranged from "Brilliant, I want one of these" to "What? Why do the enterprises keep solving organisational problems with expensive technology?". In companies with cross-functional teams that include both business and technical people, teams are already enabled to get any data they need. Aren't you applying an expensive patch for a symptom instead of solving the real problem by rearranging people?

B: I know, some companies organise like that and it is more effective, but the separation of business and tech is the reality that we have. It can never change, and we have to live with it.

Me: But your US Pacific competitor X recruits from Silicon Valley, with a culture of close business and tech collaboration, and has solved this problem for real.

B: Yes, I know, but they have so many other problems that we can compete with them anyway.

So, even if the solution is known, the ghost of data present is so strong in holding them right where they are that making real progress is unthinkable. I do agree that generative AI for analytics does have merits in big corp settings and can be part of a real solution, but that's beside the point. What stood out to me is the relation to progress and the acceptance of a state of corporate mediocracy in an important capability.

领英推荐

Iceberg: Building AI Apps on a Solid Data Foundation

Brij kishore Pandey 7 个月前

Data Bricks - The New Way to Manage Data Efficiently

Miracle Software Systems, Inc 10 个月前

Choosing the Right Data Engineering Platform:…

Sanjay Kumar MBA,MS,PhD 7 个月前

Company X is growing faster than B, has 3x the revenue, and 20x market cap.

Next, a hardware vendor C spoke enthusiastically about the potential that generative AI brings. Traditional hardware vendors have not been favoured by all the migrations to cloud, but now they rejoice at the green field of opportunities, as long as they can convince customers that they need GPUs.

C: It's a Klondike out there! We are currently doing workshops with a legal sector government agency that wants a RAG solution.

Me: But is generative AI appropriate for legal use?

C: As long as references are included, it should be ok.

Me: A RAG includes a search engine. Are you saying that a traditional search engine is the part that is actually useful, and not the generative AI?

C: Pretty much, but everyone wants generative AI.

The ghost of data yet to come is so strong that the incumbents believe that they need and can utilise racks of GPUs. Since competence in turning data and AI into sustainable value is scarce, many buy the pixie dust. When it's rich companies, I don't mind so much, but when it is our tax payer money, it hurts inside. Each million wasted in schools, healthcare, or the legal system causes real harm. Like Klondike, the only groups that collectively earn a net profit are vendors - shovel salesmen, saloon owners, and server manufacturers. There is real value in generative AI, but use cases and return on investment are more rare than the hype indicates and building reliable products require systematic application-specific data engineering, which is out of reach for many buyers due to scarcity of data engineering skills.

So, after staring the three ghosts in the eye, some things are clear.

Companies where most data and AI potential remains untouched and that are far behind the leaders typically do not believe that things can be radically different.
Many companies that do know that things can be different still prefer not to improve, due to cost, fear, friction, or some other strong factor keeping the status quo.
Since generative AI is the first form of AI accessible to non-techies, it has opened their eyes to the potential with AI, but the hype stirs so much confusion that reasoning about value creation is increasingly difficult.

The conversations above are not unique events - I have had them many times, in different variations. Concise variants just happened to line up this winter.

I started a business with the intent to enable many companies to enjoy what the data leaders have, to go beyond data horses and carts to industrialised data processing. We have spent years trying to explain the data & AI divide - how much more companies could do with data and AI, the value it could bring, and how to make progress on that path. The ghosts told me that it is futile.

If we look at general software engineering, it took a couple of decades to go from the horse-and-cart age dominated by 4GL, low-code tools for democratised programming, Visual Basic, manual deployments, etc, to today's industrialised software engineering, with DevOps, continuous deployment, infrastructure as code, containers, GitOps, and other techniques that enable us to build automated processes, climb in abstraction layers, and spend less time with operational toil. In data engineering, we are currently in the peak of the 4GL phase, where "Is it easy to use?" trumps "What can we achieve?" We know what the industrial future looks like for data engineering, because the leaders are already there, but the future is so unevenly divided that most teams don't know that it exists. Imagine travelling back 20 years in time to try to explain that developers should deploy to production multiple times per day using continuous deployment pipelines with container orchestration.

10 years from now, there probably will be no need to explain the difference between data warehousing and industrialised data engineering, but right now, making clients see it and want to make progress is a necessary cornerstone for our company. The stone is loose, and the prospects of a sustainable business with this particular mission are therefore slim. So we are considering our options. I'll make another post about it, but to be brief for now we are open to any kind of suggestion.

We have managed to reach data engineering productivity KPIs normally reserved for the technical giants and achieved > 10x speed and cost efficiency compared to other data efforts we encounter. It has been really valuable in a few contexts with the necessary prerequisites in place, but they are too far apart, and we are now looking to put our capabilities to better use than we have done in the last few years. All kinds of options are on the table.

My conference presentations have been a primary marketing channel. It has served well in the past but no more. Speaking takes a lot of energy for me, and if it doesn't give return on investment, I will ramp that down. I might do occasional presentations that seem particularly fun, but not on a regular basis for the foreseeable future.

Daniel Nilsson-Cole

Sr. Data Engineer @ Urban SDK | Writing to help Jr. software engineers improve their skills and quality of life

1 个月

Lars, I can only imagine your frustration.? Especially with you being an engineer that has helped technical giants reach the highest levels of data productivity, and with you knowing how to build solutions to help companies reach those levels of productivity.? It is sad to hear that you won't be doing presentations, but I understand. I first came across your presentations in 2020. I was a web developer at the time and didn't know what data engineering was.? Learning from your presentations motivated and enabled me to transition into data engineering in 2021. I have continued to study your presentations over the years. Your presentations have helped me grow as a data engineer, and I reference them frequently for designing the projects I work on.? Thank you for the valuable wisdom you have shared about building industrialized data factories.? I firmly believe that eventually, majority of companies will clearly understand the necessity and value of industrialized data engineering, although it may be 10+ years from now, as you stated.? Unfortunately, many companies will first go through much pain and suffering from trying to jump into AI without establishing a solid data engineering foundation. I wish you continued success!

Rolv Seehuus

Helping to save one million lives with Laerdal Medical

2 个月

"Since competence in turning data and AI into sustainable value is scarce, many buy the pixie dust." - I'm stealing that quote, Lars! Thank you :)

2 次回应

Christopher Bergh

CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!

2 个月

Lars. I empathize with your feelings 100%. You and I have similar views on the data engineer process and how Agile/Lean/DataOps is the solution. Yet the current dialog is more tech, more suffering, more failure. I'd love to connect live and talk more!

Stefan Farestam

Co-founder at Svipe

2 个月

I empathize with you. The challenge with a value prop (any value prop) is that you need to find a customer for which the timing and organizational maturity matches. There will also always be the ghost of existing investments. Now, for the companies where there is no match you may (unfortunately) find a lot of interest, but it will never progress to an engagement. So, red herrings that will consume a lot of your energy. You may still achieve an impact with these audiences (which is of cource nice), but will see no revenue upside on your end. So, what to do? Some options are: 1) continue your existing business, but improve/change your marketing and sales 2) productize your offering so that it becomes more easily consumable 3) create a new product where your knowledge is embedded and where your approach makes a 10x difference. Good luck!

1 次回应

Hannu Varjoranta

Co-Founder & VPoE @ Valo | Builder and doer: Infrastructure, applied AI, Data, Databases, SaaS, GCP

2 个月

Thank you for thr great article and insights. Not only in this post but in general. We are building a new company that is data heavy. I believe that we are on good track, but I am afraid there is a lot I am missing for setting up the data foundations in the right way. Can you point me to resources that would be helpful for context where the data is not yet that big and people are still open for new, as well as the company structure is not yet rigid. There are signs of possible hyper growth in the future. How do I set our data journey on right tracks?

查看更多评论

要查看或添加评论，请登录

Lars Albertsson的更多文章

The data divide - data success factors vs friction

2024年5月16日

The data divide - data success factors vs friction

For some time we have witnessed the so-called Data Divide - a wide and growing difference between companies in the…

3 条评论
Celebrations, risks, and spending the time given to us

2024年1月2日

Celebrations, risks, and spending the time given to us

The last two years have been a roller coaster for me. 2022 was, from a professional perspective, both the worst and the…

8 条评论
Volvo Cars and the digital race

2023年4月28日

Volvo Cars and the digital race

I read the morning paper today (some of us still do) and found an interview with Jim Rowan, CEO of Volvo Cars. He says…

20 条评论
Data management as code

2021年5月30日

Data management as code

- "Your business model is interesting. Tell me, how do you handle data management efficiently?" Scling's business model…

4 条评论
You cannot copy Lean nor DataOps

2021年5月23日

You cannot copy Lean nor DataOps

"One General Motors vice president even ordered one of his managers to take pictures of every inch of the NUMMI plant…

3 条评论
The great capability divide

2021年5月14日

The great capability divide

On Monday, I opened the health care application "Alltid ?ppet" ("Always Open"), but it was down due to a rush to get…

7 条评论
The fallacy of the generic AI startup

2021年3月10日

The fallacy of the generic AI startup

In the 2016 book "The Inevitable", Kevin Kelly, editor of Wired, wrote "The business plans of the next 10,000 startups…

5 条评论
What is wrong with Infosec?

2020年12月7日

What is wrong with Infosec?

Why do we still have breaches? What is wrong with infosec? The infosec community is on the wrong side of speed. Let me…
An infectious doctor's perspective on COVID-19

2020年3月15日

An infectious doctor's perspective on COVID-19

I usually stick to professional posts, but I'll make an exception for the clearest COVID-19 explanation that I have…

2 条评论
Building data processing pipelines - Jfokus slides

2016年2月10日

Building data processing pipelines - Jfokus slides

2 条评论

See all articles

Three conversations about data

Lars Albertsson

?????? Founder at Scling

领英推荐

Lars Albertsson的更多文章

社区洞察

其他会员也浏览了

Guest Post by Alistair Croll, Author of Lean Analytics

Fall in Love with Data Automation This February ??

Databricks Unity Catalog - Best Practices

Scaling Data Pipelines: 6 Hard Lessons Every Data Engineer Must Learn

Snowflake Acquired Datavolo: What’s Next for Data Analytics?

Preview of Databricks DataAI Summit: Databricks vs. Snowflake Battle

The Data Advantage Matrix, reshaping data engineering (and the rise of the metadata engineer), measuring analytical work, and more

StarTree Data Infra Team - Crafting the world's finest Pinot

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

领英推荐

Lars Albertsson的更多文章

The data divide - data success factors vs friction

Celebrations, risks, and spending the time given to us

Volvo Cars and the digital race

Data management as code

You cannot copy Lean nor DataOps

The great capability divide

The fallacy of the generic AI startup

What is wrong with Infosec?

An infectious doctor's perspective on COVID-19

Building data processing pipelines - Jfokus slides

社区洞察

其他会员也浏览了

Guest Post by Alistair Croll, Author of Lean Analytics

Fall in Love with Data Automation This February ??

Databricks Unity Catalog - Best Practices

Scaling Data Pipelines: 6 Hard Lessons Every Data Engineer Must Learn

Snowflake Acquired Datavolo: What’s Next for Data Analytics?

Preview of Databricks DataAI Summit: Databricks vs. Snowflake Battle

The Data Advantage Matrix, reshaping data engineering (and the rise of the metadata engineer), measuring analytical work, and more

StarTree Data Infra Team - Crafting the world's finest Pinot

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt