Aevum Data Digitalis. Part 1 - operations, facts, and dimensions
Ivan Trusov
Enabling teams to build efficient Data Intelligence Platforms with Databricks. All views and opinions are my own.
Working in the data industry over the past several years has been a wild ride with all the new technologies, approaches, and concepts constantly emerging.
In this series of blog posts I would like to highlight some of the interesting turns that I've seen in the history of data architectures, as I've seen them evolving.
We’ll start back in the 2000s and early 2010s. At that time, people were already talking about “big data,” but it was mostly a way to describe the characteristics of incoming data flows. The 3Vs—volume, velocity, and variety—showed up in presentations, but not many companies fully grasped the importance of having a solid data strategy. Most, including the ones I worked for, were still DWH-driven, relying on weekly or monthly ETL processes with occasional analytical projects on the side.
Meanwhile, tech giants were already extremely advanced in processing huge datasets and building business-critical applications based on data and machine learning. Enterprise companies were just getting started. Back then, we didn’t call it "big data" or "ML" in most companies. Instead, it was referred to as "advanced analytics" or "statistical modeling."
Main data products of that time in the enterprises were built for use cases like:
The typical data architecture of a mid-sized enterprise company back then was looking like this:
A large monolithic application, typically written in Java or C#, often had numerous plugins, servlets, and moving parts. It was usually connected to another monolith - an on-premise OLTP database, which served as the primary source of information. To reduce latencies, this database was sometimes sharded across regions.
The app and underlying DB had extremely strict SLAs and high availability requirements, making it unthinkable to connect directly to it for analytical queries or to perform any operations on hot data. To meet analytical and reporting needs, a read-only replica of the OLTP database was typically established. A dedicated ETL tool would then connect to this replica, processing tables into a long-ago planned DWH model using incremental loads and heavy aggregation (often at the granularity of months).
Back then, it was quite possible to have an email from the DWH admin saying something like "hey, I've just stopped your query because it's overloading the database, and ELT procedures struggle to keep up with the SLAs".
Due to the high cost of storage in such DWH systems, retaining detailed data over long periods was not feasible. Instead, aggregated data marts were created and updated incrementally. These marts were then linked to BI tools and advanced analytics systems.
Looking back now, it’s easy to see the downsides of this approach.
Regular re-training of machine learning models was also still pretty new. I remember churn models being rebuilt every three months, with a six-month prediction window, because the data marts with heavy aggregations only updated once a month.
That worked for applications like probability-of-default models, where a one-year prediction window meant data freshness wasn’t a big issue.
However, more advanced use cases, like weekly customer lifetime value estimation, were considered too costly or complicated to implement. The hardware and effort needed to speed up ETL processes made these scenarios hard to justify in terms of cost and benefits. The potential business value hidden in the data was yet to be discovered.
There was still a long way to go before data applications reached the level we see today.
In the next part, I’ll talk about how cloud scalability, microservices, and event-driven systems became key players in the Big Data era, and how they changed data architectures.
Stay tuned, and don’t forget to subscribe for more!
Data Scientist
4 个月How about an article on Medium? :)
Unlocking Business Potential with AI Solutions | Senior Solutions Architect @ adidas | Certified Expert in Databricks, AWS & GCP | Writer & Speaker | MLflow Ambassador ??
5 个月Looking forward for the next part Ivan Trusov ??