Complex concepts made simple!
Muhammad Hamza Javed
Lead Data Engineer | AWS Certified | Databricks Certified | Azure | Spark | Airflow | AirByte | Snowflake
Today, I studied some concepts that novice Data Engineers often neglect while working on a data project. I studied the following and I’ll try my best to explain each of them in the simplest possible words.
Let’s start by explaining Data Security:
I believe security is the most underrated consideration when it comes to building data solutions. The urge to neglect the security best practices and jump right into learning Python, PySpark, and SQL is very common, but no matter how good you are at core data engineering, one mistake, one breach or one phishing attack can bring your company in the news. And trust me, you wouldn’t want such a highlight.?
Since most of data engineering happens on the Cloud, let me share a few things you can consider to improve your security.?
These are some general tips if you’re using AWS to host your data solution.
Feel free to extend this list in the comments.
Next, let’s talk about Data Management:
It’s a huge concept with books written on Data Management alone.?
But let me try to demystify some of the concepts.
The first concept under Data Management is Data Governance.
Sounds intimidating? Not at all. It’s very simple.
Data Governance says if you’re storing your data somewhere, It should be “discoverable”.?In simple words, It means that whenever a stakeholder or your CEO asks for a dataset, you should be able to find it.
What does that mean?
Imagine you have got a dataset about sales and your CEO wants some report from the sales data. Imagine you’ve 20,000 other datasets in the same data lake.?Now what I mean by “discoverable” is that If your CEO asks for a certain data, finding that dataset should be as easy as typing a few keywords in a search bar.
Wait a second.
Can’t I just find an Excel spreadsheet with sales data and send it to my CEO?
No, you can’t.
The problem is, when you’re dealing with big data, your data is partitions and stored in multiple locations and it’s not as easy as searching based on file names.
So what should we do?
Add METADATA to the data sets.
Let me explain what metadata is:
Meta Data is simply “data about data”. If you have a table in your database and you have a text file that explains the columns in your table, that file is meta-data. Any text, number, or boolean that provides you additional information about the table could be qualified as meta-data. As you expected, it’s not that simple:
领英推荐
There are four kinds of meta-data:
What's Data Accountability?
Data accountability is simply assigning a role to someone to ensure that data management practices are not compromised.
What's Data Modeling?
If you’re creating a database table and defining the attributes of that table, it’s data modeling. If you’re a backend engineering, and you’re defining the JSON returned by your API as a response, it’s called data modeling. If you’re a firmware engineer and you’re defining how the data is going to be stored in memory, it’s data modeling.
Now let’s talk about Data Lineage:
Assume you have a dataset (C) that was made by combining some columns from two datasets (A and B).
C = A + B
Now let’s assume you have another dataset (D) from another organization.
You’ve now created a 5th dataset called dataset (E) by combining C and D.
E = D + C
Okay, now the most recent dataset is (E).
I just have one question.
If I give you the dataset (E), would you be able to identify all the ingredients?
Would you be able to tell me how many datasets were involved in Dataset E?
If yes, then you have a data lineage.
Data Lifecycle Management:?
Before the era of cloud computing, it was okay to store all the data ever created on-prem - if you had some extra space. This wouldn’t cost you a penny because you’d already paid the the on-prem hardware years ago. But now, on Cloud platforms, every single BIT could be reflected in the billing. You could trace every single penny in your bill back to the 1KB JSON file on S3 buckets. Data Engineers can use Data Lifecycle to manage their data better. Now what I mean by that. Almost all the cloud solutions offer their storage in multiple tiers. Most of them offer an archival tier that stores the data at a much lower cost. Data Engineering should consider moving their stale data to archival to save money.?
Please help refine the blog in the comments sections. Let's discuss it!
Lead Data Engineer | AWS Certified | Databricks Certified | Azure | Spark | Airflow | AirByte | Snowflake
1 年Syed Hamza Hassan ??posted! :)