Modern Data Stack – Unlocking the Analytics and Data Science Constraints

Modern Data Stack – Unlocking the Analytics and Data Science Constraints

With Fabiane Bizinella Nardon e Lucas Demitroff Brandi

These meeting minutes are extracted from Data On Air Podcast Episode 24 and reflect the participants’ thoughts and professional experience over the discussed topic.

The Modern Data Stack concept is based on Cloud Database, Modern Data Ingestion process, and self-service Analytics. They converge to load and transform data in at least three layers to simplify and guarantee reliable access to Analytics, Business Intelligence (BI), and Data Science applications. It applies a reverse ETL making data actionable for everyone in the organization. Contrary to common sense,?simplification?is more important than?modern?in a Data Stack.

The main difference between the Traditional and the Modern Data Stack is the ability to simplify the integration between different technologies. In a Modern approach, it is common to use other tools to achieve specific goals in each piece of the data stack. For example, you can combine a Cloud Database like Google Big Query or Snowflake with DBT to transform the data into different layers. You can have various Integration tools depending on the use case. It is essential to select the right tools to facilitate user access and parts of data in the Stack. In the Traditional approach, usually, we use a single platform to respond to all business needs.

Regarding Cloud Database, it is critical to differentiate them from the simple database move to the Cloud. Note that Cloud providers maintain Cloud Databases. They are scalable and flexible enough to guarantee performance in almost any situation. You will pay only for the billed services, which are often called DBaaS (Database as a Service).

Because no single database will address all business needs, you will probably need more than one. You will frequently need to work with other data sources like IoT and streaming data. A Data Catalog is crucial to keep your data mapped, organized, and achievable for Analytics applications, including BI tools and Data Science projects.

The Data Integration is another critical process to make it easier to extract data using the Data Catalog, transform and load it into at least three different layers: bronze, data in raw format or Extract and Load (EL); silver, data with little transformation or Extract, Load, and Transform (ELT); and, finally, gold, data with more transformation or Extract, Transform, and Load (ETL). Having the appropriate tools to control the Ingestion and Curation process, using timestamps and other resources, will enable Data Lineage, which is fundamental to implementing Data Governance. In a Traditional approach, it is easy to find hundreds of human-developed ETL codes with jobs running in parallel that usually takes hours to finish. Lack of documentation and control over the Integration process implies poor Data Governance.

A Cloud Database to store all critical data in different layers and the Integration processes to make the data available promptly enable Self-Service Analytics. Note that it is not only the Self-Service BI Tool that will empower business decisions. You need a Modern Data Stack with self-service BI tools to accelerate the business process. A Modern Data Stack will make data available whenever it is requested. The Data Catalog will be responsible for integrating all data layers permitting businesses to unlock Analytics and Data Science.

In a Data Science project, some tools will connect straight to the data wherever it is stored. More than 70% of the time spent in Data Science projects is closely related to Data Curation and Integration. Data stored in layers will speed up the overall project development and avoid moving data from the database, putting it into a file, developing some code, and making it reachable for data scientists.

Most Analytics and Data Science projects can potentially implement the Reverse ETL. The Reverse ETL is used to get analyzed actionable data back to Transactional applications to improve decision making, fix issues, and prevent fraud.

It is generally easier for startups and new companies to implement a Modern Data Stack. But it does not exclude companies with Traditional Data Stack implementations from moving to a Modern approach. One of the main characteristics of a Modern Data Stack is the progressive approach. A huge company with many structured processes will not get rid of them from one day to another. The idea is to focus on specific business cases that would make sense to use new tools and integrate them into the Data Stack. Moving one use case after the other will let the company get the maturity to implement the concept step by step.

Leave your comment or suggestion below. See you in our following Data On Air Episode - Meeting Minutes. Thanks for reading.

要查看或添加评论,请登录

Celso Poderoso的更多文章

社区洞察

其他会员也浏览了