登录查看更多内容

DataCamp - Data Engineering with Python

Filipe Balseiro

?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!

发布日期: 2022年4月24日

+ 关注

Data Engineers

Data engineers deliver:?

The correct data?
In the right form?
To the right people?
As efficiently as possible?

Responsibilities?

Ingest data from different sources?
Optimize databases for analysis?
Remove corrupted data?
Develop, construct, test and maintain data architectures?

?Big Data?

Have to think about how to deal with its size?
So large traditional methods don't work anymore?

The five Vs?

Volume (how much?)?
Variety (what kind?)?
Velocity (how frequent?)?
Veracity (how accurate?)?
Value (how useful?)?

?Data Engineering vs Data Scientists?

??Data Pipeline?

Automate flow from one station to the next?
Provide up-to-date, accurate, relevant data?

?Data pipelines ensure an efficient flow of the data?

ETL?

Popular framework for designing data pipelines?
Extract data?
Transform extracted data?
Load transformed data to another database?

Data Pipelines?

Move data from one system to another?
May follow ETL?
Data may not be transformed?
Data may be directly loaded in applications?

?Data Structures?

Structured data?

Easy to search and organize?
Consistent model, rows and columns?
Defined types?
Can be grouped to form relations?
Stored in relational databases?
About 20% of the data is structured?
Created and queried using SQL (Relational databases)?

领英推荐

Data Engineering: From Zero ETL in the Past to LLM as…

Dr. RVS Praveen Ph.D 1 年前

UNDERSTANDING DATA ENGINEERING

Brandon Opere 1 年前

The Role and Importance of Data Engineering in the…

Sankhyana Consultancy Services Pvt. Ltd. 7 个月前

?Semi-structured data?

Relatively easy to search and organize?
Consistent model, less-rigid implementation: different observations have different sizes?
Different types?
Can be grouped, but needs more work?
NoSQL databases: JSON, XML, YAML (file formats)?

?Unstructured data?

Does not follow a model, can't be contained in rows and columns?
Difficult to search and organize?
Usually text, sound, pictures or videos?
Usually stored in data lakes, can appear in data warehouses or databases?
Most of the data is unstructured?
Can be extremely valuable?

?Using Spotify as an example, unstructured data consists in:?

Lyrics?
Songs?
Albums pictures?
Artists profile pictures?

?Adding some structure?

Use AI to search and organize unstructured data?
Add information to make it semi-structured?

?SQL

Structured Query Language?
Industry standard for Relational Database Management System (RDBMS)?
Allows you to access many records at once, and group, filter or aggregate them?
Close to written English, easy to write and understand?
Data engineers use SQL to create and maintain databases?
Data scientists use SQL to query (request information from) databases?

?Database schema?

Databases are made of tables?
The database schema governs how tables are related?

Data lake vs Data warehouse?

Data catalog for data lakes?

What is the source of this data??
Where is this data used??
Who is the owner of the data??
How often is this data updated??
Good practice in terms of data governance?
Ensures reproducibility?
No catalog --> data swamp?

?Good practice for any data storage solution?

Reliability?
Autonomy?
Scalability?
Speed?

?Database vs. Data warehouse?

Database:?

General term?
Loosely defined as organized data stored and accessed on a computer?

Data warehouse is a type of database?

要查看或添加评论，请登录

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

2022年6月18日

Introduction to Streaming - Apache Kafka

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…
Spark - Setting up a Dataproc Cluster on GCP

2022年6月17日

Spark - Setting up a Dataproc Cluster on GCP

Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

6 条评论
Apache Spark

2022年6月13日

Apache Spark

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

3 条评论
DBT- Data Build Tool (Part II)

2022年6月11日

DBT- Data Build Tool (Part II)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

2 条评论
DBT- Data Build Tool (Part I)

2022年6月10日

DBT- Data Build Tool (Part I)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

3 条评论
BigQuery

2022年5月5日

BigQuery

Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…
Youtubers Popularity

2022年3月14日

Youtubers Popularity

Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

12 条评论
Google Data Analytics Professional Certificate Capstone Project: Cyclistic

2022年1月29日

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

See all articles

DataCamp - Data Engineering with Python

Filipe Balseiro

?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!

Data Engineers

?Data Structures?

领英推荐

Filipe Balseiro的更多文章

社区洞察

其他会员也浏览了

The Role and Importance of Data Engineering in the Modern Data Landscape

The Critical Role of Data Engineering in Today's Data-Driven World

Why Should You Hire A Data Engineering Consultant — Full Guide

Demystifying File Formats in Data Engineering

Sr. Data Engineer - FULL TIME DIRECT - Oak Brook, IL

Data Engineering, the future of Data Warehousing?

Data Engineer vs. Data Scientist vs. Data Analyst: Which Role Fits You Best?

Data Formats and Compression in Data Engineering: Best Practices for CSV, Excel, JSON, Parquet, and Avro

Transitioning from Data Science to Data Engineering: A Guide for Success

SQL vs. NoSQL for Enhancing Data Infrastructure at ABC Corporation - A Big Data Coursework

Data Engineers

?Data Structures?

领英推荐

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

Spark - Setting up a Dataproc Cluster on GCP

Apache Spark

DBT- Data Build Tool (Part II)

DBT- Data Build Tool (Part I)

BigQuery

Youtubers Popularity

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

社区洞察

其他会员也浏览了

The Role and Importance of Data Engineering in the Modern Data Landscape

The Critical Role of Data Engineering in Today's Data-Driven World

Why Should You Hire A Data Engineering Consultant — Full Guide

Demystifying File Formats in Data Engineering

Sr. Data Engineer - FULL TIME DIRECT - Oak Brook, IL

Data Engineering, the future of Data Warehousing?

Data Engineer vs. Data Scientist vs. Data Analyst: Which Role Fits You Best?

Data Formats and Compression in Data Engineering: Best Practices for CSV, Excel, JSON, Parquet, and Avro

Transitioning from Data Science to Data Engineering: A Guide for Success

SQL vs. NoSQL for Enhancing Data Infrastructure at ABC Corporation - A Big Data Coursework