What do Data Engineer Do?

What do Data Engineer Do?

So, to define it very shortly a data engineer is that person who is responsible to collect the data from various sources to make it available for analysis purposes done by a data scientist. For example, a data engineer at YouTube needs to fetch the information related to the videos you watch and store them in a table so that the data scientist can analyze that data and recommend further videos. Quite simple, isn’t it. But no that’s not a single-step task to satisfy the data requirement of any business problem. Let’s understand this in a bit more detail what exactly does a data engineer do?

Suppose you are a data scientist at a company and your manager gives you the task of predicting the sales for Q3 for a product in India. What you will do then, you need to look upon the data engineer and describe the business problem to him. The data engineer will then search for the relevant data for that scenario. That’s the very first task of a data engineer “To Extract”. So, the starting goal is to pull out the data from various sources, and for that, you might need to set up an API or interface connection.

Now the problem is that not every platform will provide the data in a fixed format. Raw data may not make much sense to the end-users, because it’s hard to analyze in such form.

So, to handle this problem the data engineer needs to “Transform” the data into a usable format. Transformations aim at cleaning, structuring, and formatting the data sets to make data consumable for processing or analysis. This includes removing errors, changing the formats maps the same type of data into each other. Now after proper transformation, the last step comes of “Loading”.

No alt text provided for this image

The task of loading the data requires a software professional to insert the data into a database say MySQL. But the problem here is that not every data extracted from the source is in the same format also the size of the data is huge and the standard transactional databases like MySQL are not designed to address such high-speed data processing requirements. So, to solve this issue a data engineer needs to store the data in a Data warehouse. A data warehouse allows the storage of data with different schema at a centralized space allowing to run complex analytical queries.

These three steps complete the entire ETL Pipeline for any data analytics project. So, the data engineer is the person who loves to play with the data and has domain knowledge about the business problem. We can say a data engineer is a facilitator to the data scientist.?

Dr. Ashish Sharma

Associate Professor and DCoE, at GLA University, Mathura (UP)

3 å¹´

Congratulations

赞
回复

要查看或添加评论,请登录

Utkarsh Sharma的更多文章

  • reCAPTCHA: The Turing Test We Use Daily

    reCAPTCHA: The Turing Test We Use Daily

    It is amazing that we use some things so frequently that we forget to understand the mechanism behind them, like for…

  • Enable Machines to Feel: Sentiment Analysis

    Enable Machines to Feel: Sentiment Analysis

    Have you ever got a text from someone and couldn't tell if they were kidding or not? Unless we clearly tell the person…

  • Introduction to Time Series Analysis

    Introduction to Time Series Analysis

    Time series is a sequence of data points organized in time order. Forecast of data by analyzing time-based data is Time…

    1 条评论
  • Dimensionality Reduction by PCA using Orange

    Dimensionality Reduction by PCA using Orange

    The curse of dimensionality haunts every data scientist dealing with a dataset containing a large number of attributes.…

    1 条评论
  • Model Drift in Machine Learning

    Model Drift in Machine Learning

    “Change is the only constant in life.”- Heraclitus (Greek philosopher).

  • Principal Component Analysis????

    Principal Component Analysis????

    What is PCA? Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce…

    3 条评论
  • Curse of Dimensionality

    Curse of Dimensionality

    Yes, data scientists and the data handling community do suffer from this well-known curse. So, is it really a curse or…

  • Market Basket Analysis:- What will I buy next?

    Market Basket Analysis:- What will I buy next?

    Have you ever wondered, while entering a shopping store that how they organize or stack the things in a particular…

  • A beginner’s Guide to data mining : RapidMiner

    A beginner’s Guide to data mining : RapidMiner

    RapidMiner studio is a data science and data mining platform that lets users extract transform and load data to draw…

  • Database Vs Data Warehouse Vs Data Lake

    Database Vs Data Warehouse Vs Data Lake

    In this article, we are going to discuss the difference between databases, data warehouses, and data lakes. So, to need…

    1 条评论

社区洞察

其他会员也浏览了