Cloud Data Security With Snowflake (Part-1: Data At Rest)
Nick Akincilar
Analytics, AI & Cloud Data Architect | Solutions Whisperer | Tech Writer
Data security is probably the number one topic on everyone's head when it comes to moving your data into the cloud. This is especially true if you are new to the cloud and don't have a seasoned technical team that understands cloud security.
Data security has four major puzzle pieces.
- Securing data at rest meaning how the data is being guarded against malicious access while it is stored on disk
- Securing data during transit which covers how result-sets are secured while being transmitted back & forth between the consumer & the data platform
- User authentication is a way to tell how the platform identifies who you are and whether you are really you and not an imposter before allowing any access.
- User authorization is about what data you can see & do within the platform after you passed the authentication process.
For those running on-prem data solutions, most of the focus on security was around user authentication & authorization as storage & data transmission was mostly contained within the walls of your organization & private network. It was much easier to defend against external threats, as they had no way to access to your systems from outside unless they found a way through a phishing attack.
This all changes, when you move your workloads to the cloud. The emphasis on the security of data at rest, during transmission & user authentication, takes on a much bigger role as resources are no longer within the confines of your organization but instead live in publicly accessible data centers where anyone can access from outside. Making sure all your resources are secure within a shared cloud environment is a whole different ball game where the data security takes on a much higher level of importance.
This is especially true if you are new to the cloud game and confused about the sea of different options & ways to secure your data where a single misconfiguration or oversight on any of the security configuration can lead to your data being exposed.
For most solutions, security has to properly enabled, configured & maintained across all datasets, various access methods & all the users where it can get pretty complex pretty fast. This is one of the biggest challenges for on-prem companies when moving their workloads to the cloud which slows or even stops their progress.
So how does Snowflake help? Let's look at this one puzzle piece at a time. On part 1 of this article, I will cover data security at rest:
Part-1: Security of Data at rest:
When you store any piece of data, the best way to protect it from being accessed by bad actors is to encrypt it while it is stored. For most solutions, encryption is an option that has to be turned on which may impact the performance of the platform. Therefore, it is common to see not all of the data is encrypted and what is encrypted & not has to be managed by your IT team. This is because encryption & performance doesn't always play well together and the need for performance may out weight the security risks.
For Snowflake, security is not an option. All data is always encrypted, all the time and there is no option to turn this off. This means any piece of data managed by any Snowflake customer whether it is stored in its raw semi-structured format as part of a data lake or structured data stored as part of a data warehouse is by default always encrypted at rest.
However, simply performing standard encryption on a file containing a table of data like most other solutions does is nearly not strong enough to make every person at ease with storing their precious data in the cloud. This is why Snowflake takes data encryption to a whole new level.
As with most data platforms, Snowflake has a concept of a database and within a database, you store your data in various tables located in different schemas. As Snowflake is a multi-cloud platform that is provided as a service, this also means it has multiple deployments across different cloud vendors & regions. We also have many snowflake accounts corresponding to many customers in each deployment within each provider/region where a customer can regularly have multiple accounts in deployments spread across different cloud providers and regions.
So how does this hierarchy being used for data encryption within Snowflake? I will try to simplify this by explaining the entire encryption process in a way users would store an Excel file within a secured folder located in a network drive share. Again, the way data is stored in Snowflake has nothing to do with network folders & shares but we all used this way to store secure data files at one time or another at work so it should be familiar to most of us to explain the hierarchy.
Let's pretend to replicate this hierarchy of Cloud deployment, account, database & tables using a familiar concept like using network file share folders to store your excel files. You would have a shared network drive(which would be the Snowflake deployment itself), within that share you would have a secure folder for each account(this would be your snowflake account) and within that secure folder, you would have other sub-folders to group similar data files together ( as with databases). If you had to store your excel file with sales orders data for the sales department, you would probably see something like this. You would have access only to the Customer_ABC folder within the main share but not to other Customer folders:
First things first. Snowflake does NOT store data tables as singular files. A table in Snowflake is a virtual representation of many smaller files with random names in the backend. All tables in Snowflake are split into smaller chunks called micro partitions and stored as a collection of randomized files. So what looks like a SALES table in a database within Snowflake is a collection of micro partition files stored within the storage layer where the services layer(brains of the system) keeps track of how they are pieced together in the backend. This is completely automated & transparent to users and not something they can see or manage.
If we could visualize how this hierarchy would look in a more familiar view we can understand such as files in a shared drive, it would probably look like this.
Users would see and work with what looks like a single table(depicted as an excel file) but in fact, data would be stored in a series of random file chunks. Each file chunk containing only a small portion of the entire dataset.
Now let's look at the elaborate encryption process. All data that is stored within Snowflake is wrapped in 4 levels of encryption keys.
- The root key (Unique Key per Snowflake cloud deployment)
- Account master keys (Unique Key per Account)
- Table/Object master keys (Unique Key per Table)
- File keys (Unique Key per micro partition file)
Snowflake uses these smaller file chucks to apply high levels of encryption to keep your data safe & away from those pesky unauthorized users.
As the data comes into Snowflake, it is broken down into smaller chunks called micro-partitions(MP) where each chunk of file is encrypted with a unique encryption key. To assemble a table of data, you need to figure out the encryption keys of all of the micro-partition files that make up that table.
This sounds very secure but it is not the end of it. Each encryption key used for each micro-partition file is then encrypted by yet another encryption key which is based on the db table it belongs to. As if that is not enough, that key is then encrypted with another encryption key which is tied to each unique Snowflake Account. And that key is then encrypted with another encryption key that is unique to each snowflake cloud/region deployment.
In the end, the below diagram shows what it would look like. Just to be able to read the data in Sales table with only 5 micro partitions (you could easily have hundreds or thousands in production size datasets), you would have to figure out 20 unique encryption keys plus be able to match the right key to the right file that is of course if you could figure out what set of random file names make up that particular table within the database. Complex enough?
If you are a hard person to impress, there is more. You can also bring your key to the mix to control how the data is encrypted. In this case, your custom key gets appended to the Account level key which creates a concatenated account key which then gets used to encrypt all micro partition files associated with all the data stored in your account as shown below.
If we take all of this and actually transpose it on to an actual snowflake data structure, here is what it would look like.
"Holy Molly, that is a whole lot of data encryption, Doesn't all this affect the system performance? Especially if we can't turn it off !"
The short answer is No! It obviously does make a difference but it is all baked into the service where all performance levels are achieved with full encryption on. If you don't believe me, you can go read many of the Snowflake reviews anywhere online where users do nothing but rave about its blazing-fast performance and ease of use compared to anything else they used before.
At this point, you would think this was the end of Snowflake's encryption story when it comes to data at rest. You would be wrong, we are not done yet.
So far we learned that Snowflake splits datasets & stores them as smaller file chunks and encrypts each chunk by a totally unique encryption key which is then encrypted by 3 additional encryption keys plus an optional self-managed key(BYOK) if you choose.
If this was not enough Snowflake adds two other levels of protection on top of all this.
Automated 30-day key rotation.
This is an automated feature & baked into the service. Snowflake will change the current account & table level encryption keys every 30 days and will retire the previous keys to be used to decrypt the older data. What this means is any data you accumulate over 30 days will be encrypted using different sets of unique table & account level keys.
As an example, if you had sales data being ingested and stored in Snowflake every day, Snowflake would create a new set of micro-partition files(chunks) for any of the new or modified data on each day. If you tried to query this table one year later, files created on the first month would be encrypted with a different set of account & table keys than files created on month 2 or month 3 &, etc. It would essentially use the current set of keys for the data being created now and use previously used historical keys to read earlier chunks that were created more than 30 days ago. This would result in using 12 sets of different key combinations to assemble 12 months of data which is on top of having to use a unique encryption key for each file chuck as well as using the root key at the deployment level. Again this is automated and transparent to Snowflake users & admins where all they have to do is to load & query the data and leave the encryption to Snowflake with no downtime or performance hit.
Automated Yearly Re-Keying
Where key rotation uses a new set of account & table level keys every 30 days for any new data and keeps previous keys intact for accessing older data, Re-Keying instead overwrites all the keys for all micro partition chunks with an entirely new set of keys including current & historical keys. This is an option you have to turn it on within your account using a single line of SQL and once enabled, Snowflake will automatically re-key and change the encryption keys for all your data yearly and do this without any service disruption as part of an easy to use the platform as a service offering which Snowflake is known for.
What about any temp cache data?
There is one more twist to storing data and that is temporary data that is being stored as part of results cache for any user query. Any time a query executes, the resulting dataset is written to storage & stored for 24 hours before being sent back to the user. If the exact same query is executed again within 24 hours and the underlying data within the tables did not change, Snowflake will simply send the previous set of results from the cache instead of running the query again without using any compute resources. This increases performance for users when identical queries are executed over & over again as with BI dashboards where different users see the same data when they each view to their dashboards. Even this cache data is subject to the same rules in terms of encryption where every single cached query resultset is encrypted using the same combination of 4 different encryption keys where one of those keys is unique for each data chunk associated within each result-set.
This also means anytime a query is executed and results are transmitted to the user, each resulting dataset is uniquely encrypted. Plus if the resultset is large enough, Snowflake will also break the resultset down & send it in multiple chucks using a different set of 4 keys for each chunk that makes up the query results.
We finally completed Snowflake's data encryption at rest and query resultset is a good segue to Part-2 of our encryption story which is securing data in transit.
Please tune in for my next article around the encryption of data in transit.
As always, if you like this content and think it may benefit others in your network, please click on like or feel free to share.
Sr. Computer Engineer/Solutions Architect Lead
3 年Great info with awesome examples! Thank you.
Jouw partner bij het vertalen van de juiste data naar tijdige beslissingen | ?????????????????????? | ?????????????????? ?? ???????? ?????????????????? | Consulting Partner Data Management @ ????????????????
4 年Great write-up Nick!