登录查看更多内容

Aevum Data Digitalis. Part 1 - operations, facts, and dimensions

Ivan Trusov

Enabling teams to build efficient Data Intelligence Platforms with Databricks. All views and opinions are my own.

发布日期: 2024年10月7日

Working in the data industry over the past several years has been a wild ride with all the new technologies, approaches, and concepts constantly emerging.

In this series of blog posts I would like to highlight some of the interesting turns that I've seen in the history of data architectures, as I've seen them evolving.

We’ll start back in the 2000s and early 2010s. At that time, people were already talking about “big data,” but it was mostly a way to describe the characteristics of incoming data flows. The 3Vs—volume, velocity, and variety—showed up in presentations, but not many companies fully grasped the importance of having a solid data strategy. Most, including the ones I worked for, were still DWH-driven, relying on weekly or monthly ETL processes with occasional analytical projects on the side.

Meanwhile, tech giants were already extremely advanced in processing huge datasets and building business-critical applications based on data and machine learning. Enterprise companies were just getting started. Back then, we didn’t call it "big data" or "ML" in most companies. Instead, it was referred to as "advanced analytics" or "statistical modeling."

Main data products of that time in the enterprises were built for use cases like:

advanced dashboarding
in-depth analytics of specific business processes or marketing campaign results
churn/probability of default modeling
Next Best Offer kind-of scenarios

The typical data architecture of a mid-sized enterprise company back then was looking like this:

Typical data architecture from the 2000s/early 2010s

A large monolithic application, typically written in Java or C#, often had numerous plugins, servlets, and moving parts. It was usually connected to another monolith - an on-premise OLTP database, which served as the primary source of information. To reduce latencies, this database was sometimes sharded across regions.

The app and underlying DB had extremely strict SLAs and high availability requirements, making it unthinkable to connect directly to it for analytical queries or to perform any operations on hot data. To meet analytical and reporting needs, a read-only replica of the OLTP database was typically established. A dedicated ETL tool would then connect to this replica, processing tables into a long-ago planned DWH model using incremental loads and heavy aggregation (often at the granularity of months).

Back then, it was quite possible to have an email from the DWH admin saying something like "hey, I've just stopped your query because it's overloading the database, and ELT procedures struggle to keep up with the SLAs".

Due to the high cost of storage in such DWH systems, retaining detailed data over long periods was not feasible. Instead, aggregated data marts were created and updated incrementally. These marts were then linked to BI tools and advanced analytics systems.

Looking back now, it’s easy to see the downsides of this approach.

Regular re-training of machine learning models was also still pretty new. I remember churn models being rebuilt every three months, with a six-month prediction window, because the data marts with heavy aggregations only updated once a month.

That worked for applications like probability-of-default models, where a one-year prediction window meant data freshness wasn’t a big issue.

However, more advanced use cases, like weekly customer lifetime value estimation, were considered too costly or complicated to implement. The hardware and effort needed to speed up ETL processes made these scenarios hard to justify in terms of cost and benefits. The potential business value hidden in the data was yet to be discovered.

There was still a long way to go before data applications reached the level we see today.

In the next part, I’ll talk about how cloud scalability, microservices, and event-driven systems became key players in the Big Data era, and how they changed data architectures.

Stay tuned, and don’t forget to subscribe for more!

Alexey Egorov

Data Scientist

4 个月

How about an article on Medium? :)

Rahul Pandey

Unlocking Business Potential with AI Solutions | Senior Solutions Architect @ adidas | Certified Expert in Databricks, AWS & GCP | Writer & Speaker | MLflow Ambassador ??

5 个月

Looking forward for the next part Ivan Trusov ??

查看更多评论

要查看或添加评论，请登录

Ivan Trusov的更多文章

End-to-end RAG application with source retriveal on Databricks Platform

2025年2月12日

End-to-end RAG application with source retriveal on Databricks Platform

?? Intro Modern data platforms provide various ways to interact with data. Some cases include visual analysis tools…

10 条评论
Modern Python project management with uv and Databricks Asset Bundles

2024年12月25日

Modern Python project management with uv and Databricks Asset Bundles

The infrastructure for Python projects has undergone several changes in the last few years. Several years ago, the…

8 条评论
Building data applications with Databricks Apps

2024年11月6日

Building data applications with Databricks Apps

With the recent introduction of Databricks Apps, the capabilities of the Databricks platform extended to also cover a…

20 条评论

Ivan Trusov的更多文章

End-to-end RAG application with source retriveal on Databricks Platform

Modern Python project management with uv and Databricks Asset Bundles

Building data applications with Databricks Apps