DataCamp - Data Engineering with Python

DataCamp - Data Engineering with Python

Data Engineers

Data engineers deliver:?

  • The correct data?
  • In the right form?
  • To the right people?
  • As efficiently as possible?

Responsibilities?

  • Ingest data from different sources?
  • Optimize databases for analysis?
  • Remove corrupted data?
  • Develop, construct, test and maintain data architectures?

?Big Data?

  • Have to think about how to deal with its size?
  • So large traditional methods don't work anymore?

The five Vs?

  • Volume (how much?)?
  • Variety (what kind?)?
  • Velocity (how frequent?)?
  • Veracity (how accurate?)?
  • Value (how useful?)?

?Data Engineering vs Data Scientists?

No alt text provided for this image

??Data Pipeline?

  • Automate flow from one station to the next?
  • Provide up-to-date, accurate, relevant data?

?Data pipelines ensure an efficient flow of the data?

ETL?

  • Popular framework for designing data pipelines?
  • Extract data?
  • Transform extracted data?
  • Load transformed data to another database?

Data Pipelines?

  • Move data from one system to another?
  • May follow ETL?
  • Data may not be transformed?
  • Data may be directly loaded in applications?

No alt text provided for this image

?Data Structures?

Structured data?

  • Easy to search and organize?
  • Consistent model, rows and columns?
  • Defined types?
  • Can be grouped to form relations?
  • Stored in relational databases?
  • About 20% of the data is structured?
  • Created and queried using SQL (Relational databases)?

?Semi-structured data?

  • Relatively easy to search and organize?
  • Consistent model, less-rigid implementation: different observations have different sizes?
  • Different types?
  • Can be grouped, but needs more work?
  • NoSQL databases: JSON, XML, YAML (file formats)?

?Unstructured data?

  • Does not follow a model, can't be contained in rows and columns?
  • Difficult to search and organize?
  • Usually text, sound, pictures or videos?
  • Usually stored in data lakes, can appear in data warehouses or databases?
  • Most of the data is unstructured?
  • Can be extremely valuable?

?Using Spotify as an example, unstructured data consists in:?

  • Lyrics?
  • Songs?
  • Albums pictures?
  • Artists profile pictures?

?Adding some structure?

  • Use AI to search and organize unstructured data?
  • Add information to make it semi-structured?

?SQL

  • Structured Query Language?
  • Industry standard for Relational Database Management System (RDBMS)?
  • Allows you to access many records at once, and group, filter or aggregate them?
  • Close to written English, easy to write and understand?
  • Data engineers use SQL to create and maintain databases?
  • Data scientists use SQL to query (request information from) databases?

?Database schema?

  • Databases are made of tables?
  • The database schema governs how tables are related?

Data lake vs Data warehouse?

No alt text provided for this image

Data catalog for data lakes?

  • What is the source of this data??
  • Where is this data used??
  • Who is the owner of the data??
  • How often is this data updated??
  • Good practice in terms of data governance?
  • Ensures reproducibility?
  • No catalog --> data swamp?

?Good practice for any data storage solution?

  • Reliability?
  • Autonomy?
  • Scalability?
  • Speed?

?Database vs. Data warehouse?

Database:?

  • General term?
  • Loosely defined as organized data stored and accessed on a computer?

Data warehouse is a type of database?

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论
  • Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Google Data Analytics Professional Certificate Capstone Project: Cyclistic

    Case Study: Help a bike-share company to convert casual riders into annual members In this article I showcase my…

社区洞察

其他会员也浏览了