登录查看更多内容

6 Responsibilities of a Data Engineer

Nima Daneshmand

Data Engineer ?? Connecting the dots with code.

发布日期: 2024年1月21日

Introduction

Data engineering is a relatively new field, and as such, there is a huge variance in the actual job responsibilities across different companies. If you are a student, analyst, engineer, or new to the data space and

Unclear with data engineers’ job responsibilities

Believe that the current state of a data engineer’s job description is messy

Then this Article is for you. In this post, we cover the 6 key responsibilities of a data engineer

.https://datasguide.com/communicate/

Responsibilities of a Data Engineer

1. Move data between systems

This represents the main responsibility of a data engineer. It usually involves

Extract: Extracting data from any number of sources. The source can be an external API, cloud storage, databases, static files, etc
Transform: This step involves transforming the data. Some common transformations are mapping, filtering, enrichment, changing the structure of the data(eg denormalizing data), and aggregating.
Load: This is the step where the data is loaded into the destination system. The destination system can be a cloud storage file system, data warehouse, and/or cache database, etc.

Common tools/frameworks: Pandas, Spark, Dask, Flink, Beam, Debezium, Kafka, Docker, Kubernetes

2. Manage data warehouse

More often than not, most of the company’s data lands within the data warehouse. The responsibilities of a data engineer in this context are

Warehouse data modeling: Model the data for analytical queries, which are typically aggregation queries on large tables. Modeling here involves applying appropriate partitions, handling fact and dimension tables, etc.
Warehouse performance: Make sure the queries are fast and the warehouse can scale as needed.
Data Quality: Ensuring data quality within the data warehouse.

Common modeling techniques: Kimball modeling, Data Vault, Data Lake

Common frameworks: Great expectations, dbt for data quality

Common warehouses: Snowflake, Redshift, Bigquery, Clickhouse, Postgres

3. Schedule, execute, and monitor data pipelines

Data engineers are also responsible for scheduling the ETL pipelines, making sure they run without any issue, and monitoring them.

Scheduling data pipelines to be run at a certain schedule or in response to some event.
Executing data pipelines and ensuring that they can scale, have the right permissions, etc.
Monitoring data pipelines for failures, deadlocks, and long-running tasks.
Managing metadata such as time of the run, end to end time taken, failure reasons, etc

Common frameworks: Airflow, dbt, Prefect, Dagster, AWS Glue, AWS Lambda, Streaming pipeline using Flink/Spark/Beam

Common databases: MySQL, Postgres, Elastic search and data warehouses

Common storage systems: AWS S3, GCP cloud store

Common monitoring systems: Datadog, Newrelic

4. Serve data to the end-users

Once you have the data available in your data warehouse, it’s time to serve it to the end-user. The end-user can be analysts, an application, external clients, etc. Depending on the end-user you may have to set up

Data visualization/Dashboard tool: Tool used by humans to analyze the data and create pretty charts that can be shared easily.
Permissions for the data: If it’s a table, then granting correct permissions to your applications or end-users. If it’s in cloud storage, granting cloud users appropriate permissions, etc.
Data endpoints(API): Some application/external clients may need API-based access to your data. In such cases, a server to send data via an API endpoint will need to be set up.
Data dumps for clients: Some clients may require data dumps from your system. In such cases, you will have to set up a data pipeline to facilitate this.

Common tools/languages: Looker, Tableau, Metabase, Superset, role-based permissions(for your system), Python/Scala/Java/Go for API endpoints, pipeline tools for client data dumps

5. Data strategy for the company

Data engineers are involved in coming up with the data strategy for the company. This involves

Deciding what data to collect, how to collect it, and store it securely.
Evolving data architecture for custom data needs.
Educating end users on how to use data effectively.
Deciding what data(if any) to share with external clients.

Common tools/frameworks: Confluence, google docs, RFC documents, brainstormings, meetings

6. Deploy ML models to production

Data scientists and analysts develop sophisticated models that closely model the working of a specific business process. When it’s time to deploy these models, data engineers are usually the ones who optimize them to be used in a production environment.

Optimizing training and inference: Setting up a batch/online learning pipelines. Ensuring the model is appropriately sized.
Setting up monitoring: Setting up monitoring and logging systems for the ML model.

Common frameworks: Seldon core, AWS MLOps

Carlos Fernando Chicata

Ingeniero de datos | AWS User Group Perú - Arequipa | AWS x3

1 年

Me gustó tu artículo, felicidades ??; mientras una organización evoluciona algunos puntos de responsabilidad pueden separar del puesto en el tiempo, pero esto refleja la relación "evolutiva" del puesto hasta cierto grado.

1 次回应

查看更多评论

要查看或添加评论，请登录

Nima Daneshmand的更多文章

ETL | ???? ???? ????? ???? ??? ???? ?? ??????? ??????

2024年7月30日

ETL | ???? ???? ????? ???? ??? ???? ?? ??????? ??????

???? ???? ??? ? ????? ?? ?????? ???? ??? ? ?????? ????: ????? ???? ??? ???? ?? ??????? ?????? ??? ???? ?? ???????…

3 条评论
v2ray ???? ?????? ??????? ??????

2024年1月30日

v2ray ???? ?????? ??????? ??????

V2Ray Client on Ubuntu Linux Desktop ???? ??????? ?? ?????? json ?? ?? ?? ???? ?????? ??????? ?????? ?? ??? ?????…

3 条评论
/* SQL Learning */

2024年1月22日

/* SQL Learning */

SELECT ????? ????? ????? ?????? ????? ????? Logical Order 1- FROM 2- WHERE 3- GROUP BY 4- HAVING 5- SELECT 5-1…
Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

2024年1月20日

Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

When do we measure data quality? ETL wit data quality checks You want to run quality checks at multiple points in your…
Popular Python Libraries and Tools to Automate DevOps Processes

2024年1月17日

Popular Python Libraries and Tools to Automate DevOps Processes

Programming is essential in the DevOps lifecycle and a versatile, efficient, and easy to learn language like Python is…
cheatsheet > Linux

2023年12月7日

cheatsheet > Linux

Linux Professional Institute Certificate > LPIC-1 LPIC 1 TOPICS System Architecture Linux Installation and Package…
Python - Error Types

2023年12月7日

Python - Error Types

syntax ,Index ,ModuleNotFound ,Key ,Import ,StopIteration ,Type ,Value ,Name ,ZeroDivision ,KeyboardInterrupt The most…
???? ????? ????? ??? ????!

2023年8月14日

???? ????? ????? ??? ????!

??? ?? ?? ???? ??? ??? ?????? ????? ?? ???? ????? ??????? ??? ?????????? ????? ????? ?????. ???? ???? ????? ?? ??????…
???? ?? ???? ?? ?? ???!

2023年7月26日

???? ?? ???? ?? ?? ???!

??? ???? ???? ?? ?? ?????? ?? ??? ?? ?? ???? ??? ?????? ?????? ?? ?? ????? ?? ????? ??? ????? ??? ?? ????? ?? ???…

3 条评论

See all articles

Introduction

Responsibilities of a Data Engineer

1. Move data between systems

2. Manage data warehouse

3. Schedule, execute, and monitor data pipelines

4. Serve data to the end-users

5. Data strategy for the company

6. Deploy ML models to production

Nima Daneshmand的更多文章

ETL | ???? ???? ????? ???? ??? ???? ?? ??????? ??????

v2ray ???? ?????? ??????? ??????

/* SQL Learning */

Data quality checks with Apache Airflow, Soda-Core and Pandas dataframes

Popular Python Libraries and Tools to Automate DevOps Processes

cheatsheet > Linux

Python - Error Types

???? ????? ????? ??? ????!

???? ?? ???? ?? ?? ???!