Complex concepts made simple!

Complex concepts made simple!

Today, I studied some concepts that novice Data Engineers often neglect while working on a data project. I studied the following and I’ll try my best to explain each of them in the simplest possible words.

  • Data Security
  • Data Management
  • Data Governance (Data Discovery and Accountability)
  • Metadata (Business, Technical, Operational, Reference)
  • Data Modelling
  • Data Lineage
  • Data Lifecycle

Let’s start by explaining Data Security:

I believe security is the most underrated consideration when it comes to building data solutions. The urge to neglect the security best practices and jump right into learning Python, PySpark, and SQL is very common, but no matter how good you are at core data engineering, one mistake, one breach or one phishing attack can bring your company in the news. And trust me, you wouldn’t want such a highlight.?

Since most of data engineering happens on the Cloud, let me share a few things you can consider to improve your security.?

  • Make sure you know what’s going on in your IAM Console.?
  • Make sure you’re applying “the principle of least privilege”.?
  • Make sure you understand how many users are there, and their IAM Roles.
  • Make sure you have a strict password policy defined for all your IAM Users.?
  • Make sure you’ve MFA Device set for all your users. So the next time someone logs in, they have to use the passcode sent to their mobile phone in addition to the username and password.?
  • Make sure your S3 buckets aren’t unnecessarily public.?
  • Also, have a look at your Network Access Control Lists (or NACLs) and Security Groups - These two act as firewalls for your AWS Cloud environment.?
  • Make sure you don’t have any ports publicly exposed.
  • Make sure your data is encrypted in S3 Buckets.?

These are some general tips if you’re using AWS to host your data solution.

Feel free to extend this list in the comments.

Next, let’s talk about Data Management:

It’s a huge concept with books written on Data Management alone.?

But let me try to demystify some of the concepts.

The first concept under Data Management is Data Governance.

Sounds intimidating? Not at all. It’s very simple.

Data Governance says if you’re storing your data somewhere, It should be “discoverable”.?In simple words, It means that whenever a stakeholder or your CEO asks for a dataset, you should be able to find it.

What does that mean?

Imagine you have got a dataset about sales and your CEO wants some report from the sales data. Imagine you’ve 20,000 other datasets in the same data lake.?Now what I mean by “discoverable” is that If your CEO asks for a certain data, finding that dataset should be as easy as typing a few keywords in a search bar.

Wait a second.

Can’t I just find an Excel spreadsheet with sales data and send it to my CEO?

No, you can’t.

The problem is, when you’re dealing with big data, your data is partitions and stored in multiple locations and it’s not as easy as searching based on file names.

So what should we do?

Add METADATA to the data sets.

Let me explain what metadata is:

Meta Data is simply “data about data”. If you have a table in your database and you have a text file that explains the columns in your table, that file is meta-data. Any text, number, or boolean that provides you additional information about the table could be qualified as meta-data. As you expected, it’s not that simple:

There are four kinds of meta-data:

  • Business Metadata: This contains information on the business-related terms. For example, If your CEO wants a dataset with all the customers. What does the “customer” mean here? Is someone who bought in the last 90 days a customer? or anyone in the last 20 years who bought $1 worth of items is a customer? This definition should be considered Business Metadata.
  • Technical Metadata: So you have got a dataset (table) in the database. How many rows does it have, how many null values does it have, and how many erroneous rows does it have is called Technical Metadata.
  • Operational Metadata: These are any logs that are produced as a result of your data engineering. For example, some logs generated as a result of your Glue job, or even your JobIDs could be considered as operational metadata.
  • Reference Metadata: These are lookup tables. For example, if your datasets involve currency conversation, then fetching the latest conversion rate is reference metadata.


What's Data Accountability?

Data accountability is simply assigning a role to someone to ensure that data management practices are not compromised.

What's Data Modeling?

If you’re creating a database table and defining the attributes of that table, it’s data modeling. If you’re a backend engineering, and you’re defining the JSON returned by your API as a response, it’s called data modeling. If you’re a firmware engineer and you’re defining how the data is going to be stored in memory, it’s data modeling.


Now let’s talk about Data Lineage:

Assume you have a dataset (C) that was made by combining some columns from two datasets (A and B).

C = A + B

Now let’s assume you have another dataset (D) from another organization.

You’ve now created a 5th dataset called dataset (E) by combining C and D.

E = D + C

Okay, now the most recent dataset is (E).

I just have one question.

If I give you the dataset (E), would you be able to identify all the ingredients?

Would you be able to tell me how many datasets were involved in Dataset E?

If yes, then you have a data lineage.

Data Lifecycle Management:?

Before the era of cloud computing, it was okay to store all the data ever created on-prem - if you had some extra space. This wouldn’t cost you a penny because you’d already paid the the on-prem hardware years ago. But now, on Cloud platforms, every single BIT could be reflected in the billing. You could trace every single penny in your bill back to the 1KB JSON file on S3 buckets. Data Engineers can use Data Lifecycle to manage their data better. Now what I mean by that. Almost all the cloud solutions offer their storage in multiple tiers. Most of them offer an archival tier that stores the data at a much lower cost. Data Engineering should consider moving their stale data to archival to save money.?


Please help refine the blog in the comments sections. Let's discuss it!

Muhammad Hamza Javed

Lead Data Engineer | AWS Certified | Databricks Certified | Azure | Spark | Airflow | AirByte | Snowflake

1 年

Syed Hamza Hassan ??posted! :)

要查看或添加评论,请登录

Muhammad Hamza Javed的更多文章

  • Data Engineering Explained

    Data Engineering Explained

    This blog is for anyone trying to understand what Data Engineering is and considering a transition towards a Data…

    4 条评论
  • Today we launched JOBID19.com in Pakistan!

    Today we launched JOBID19.com in Pakistan!

    Our Story A few weeks earlier, we heard that some software companies will be laying off their employees due to a lack…

    22 条评论
  • Time Management #100DaysOfMLCode

    Time Management #100DaysOfMLCode

    Since I’ve started this #100DaysOfMLCode, I get several queries on time management that how do I manage time, and…

    7 条评论

社区洞察

其他会员也浏览了