Data Engineering

Data Engineering

These days, "data engineering" is a hot topic, and it's often used to distinguish the work of database engineers who build and manage data pipelines from the work of data analysts/ scientists who analyze the data for useful insights.

The growing focus on data engineering stems from the realization that many machine learning projects are falling short due to issues with data quality—specifically, data that isn't clean, standardized, or usable.

With the processing power of the cloud and the capability of a data lake to consolidate all corporate data in one location, the traditional data movement pipeline we learned has shifted from extract — transform — load (ETL) to extract — load — transform (ELT).

In ELT, the work is done in the cloud (Azure/AWS/GCP etc.) and all the tools required are available as services.

In this context, Databricks has become the most popular data engineering tool these days. Databricks hides most of the complex work required in the data engineering workflow. If you want to learn and practice free, you can use the Databricks community edition.

Below is a simple diagram highlighting Databricks on Azure functions and capabilities.

Picture Credit to Ramesh Retnasamy



要查看或添加评论,请登录

社区洞察

其他会员也浏览了