AWS & Snowflake | Better Together | Highlights from Re:Invent 2022
Adam Morton
Empowering data leaders with tech-agnostic, ROI-driven data strategies, design and execution | Best-Selling Author | Founder of Mastering Snowflake Program
Thank you for reading my latest article AWS & Snowflake | Better Together | Highlights from Re:Invent 2022
Here at LinkedIn I regularly write about modern data platforms and technology trends.To read my future articles simply join my network here or click 'Follow'. Also feel free to connect with me via YouTube.
------------------------------------------------------------------------------------------------------------
The landscape of data, analytics and machine learning is constantly in flux. It’s a hugely competitive landscape which sees vendors of all shapes and sizes compete for a slice of a market which is now estimated to be worth a staggering USD 41.39 billion last year. And, if that’s not a big enough number to get your head around just wait until you hear how much it's predicted to be worth in 2030; USD 364.33 billion! Crazy.
AWS re:Invent held towards the very end of 2022 where Amazon came to the xmas party early with a flurry of interesting announcements. In this article I’ll pick the key ones which stood out to me. AWS has such a large footprint that they hold a distinct level of influence over where the market may go next, so you can think of this as a barometer of sorts for how data and analytics capabilities may evolve over the next 12 months.
Amazon DataZone
Organizations are often aware that it is important to leverage and secure their data assets, but, as they grow to petabyte scale and beyond, data invariably starts to penetrate across multiple departments, data lakes and silos. Attempting to bring this all together to ensure it's available to the right people at the right time is hard enough, and when you bring security into the mix then it creates a real challenge to strike the right balance.
AWS DataZone attempts to wrap this up in a Data Management as a Service which is designed as a way to enable businesses to bring all of their data together for use anywhere within the organization, along with granular management of essential features such as permissioning, security, and data governance. DataZone attempts to securely democratize data whilst providing a pathway to identify and make visible data silos so users across the organization can share, search, and discover data at scale across departments. The diagram below shows how DataZone is positioned within your data landscape, the idea being that you can log into a web portal and manage all of your data assets from one place.
Redshift
Central to the AWS data landscape is their version of the modern data platform, Redshift. Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. At re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.
Organizations are often faced with the challenge of attempting to consolidate data from multiple disparate data sources into an enterprise data warehouse such as Redshift in order to generate actionable insights. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse - which takes a lot of time, effort and money.
Auto-Copy from Amazon S3
One feature which aims to alleviate this pain is auto-copy from Amazon S3. Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. The diagram below illustrates this process:
This capability brings with it several key benefits:
Amazon Aurora zero-ETL integration with Amazon Redshift
In a process similar to the Amazon S3 auto-copy feature described above Redshift is now able to auto-ingest data from one or more Aurora MySQL databases into one Redshift cluster at low latency.
With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. I’d expect in the future that AWS will introduce more granular controls, such as selecting individual schemas and tables rather than the entire database.
Many customers will already leverage Aurora to store transactional data, so the ability to quickly and efficiently move this data to Redshift for near real time analysis will be highly compelling.?
领英推荐
Multi-AZ support
Another feature announced at re:Invent was around the resilience and high availability topic. Customers can now run Redshift across multiple availability zones which allows for data replication across different geographical regions to mitigate against outages or failures. Essentially, this acts as an insurance policy to ensure business critical data remains available in the event of disaster. AWS takes care of the automatic recovery while the Redshift data warehouse behaves like one centralized data warehouse to all users.?
Dynamic Data Masking
Dynamic data masking has also been introduced in preview to Redshift. This enables you to load the data once into Redshift and automatically overlay masking policies at execution time to prevent unauthorized access from those users and applications based upon their role. This simplifies the management and accessibility of storing sensitive data in your data warehouse. Previously, customers often maintained multiple copies of data, or layers upon layers of complex views to cater for these requirements.
For more information on Dynamic Data Masking and how it works on Snowflake check out my video here.
Athena for Apache Spark
Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size. However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Despite this the Spark framework is very popular - I know of several customers who use PySpark data pipelines across their AWS infrastructure.
Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.
Amazon Athena integrates with the AWS Glue Data Catalog, which inspects and creates a taxonomy of your data assets. This allows customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. Data scientists can then leverage this information to analyze, visualize and explore data in order to prepare data sets for machine learning pipelines.
AWS Glue Data Quality
When it comes to working on AI and machine learning use cases data quality is certainly an important factor when attempting to plan for a successful outcome. A new feature announced for the data integration service, AWS glue is a data quality component which attempts to identify ‘pollution’ in your data lake.?
It can analyze your tables and recommend a set of rules automatically based on what it finds. You can fine-tune those rules if necessary and you can also write your own rules. Each rule relates to a table or columns within a table which validates properties such as timeliness, accuracy and integrity. For example, a rule can indicate that a table must have the expected number of columns, that the column names match a desired pattern, and that a specific column is usable as a primary key.
To stay up to date with the latest business and tech trends in data and analytics, make sure to subscribe to my newsletter, follow me on LinkedIn, and YouTube, and, if you’re interested in taking a deeper dive into Snowflake check out my books ‘Mastering Snowflake Solutions’ and ‘SnowPro Core Certification Study Guide’.
------------------------------------------------------------------------------------------------------------
About Adam Morton
Adam Morton is an experienced data leader and author in the field of data and analytics with a passion for delivering tangible business value. Over the past two decades Adam has accumulated a wealth of valuable, real-world experiences designing and implementing enterprise-wide data strategies, advanced data and analytics solutions as well as building high-performing data teams across the UK, Europe, and Australia.?
Adam’s continued commitment to the data and analytics community has seen him formally recognised as an international leader in his field when he was awarded a Global Talent Visa by the Australian Government in 2019.
Today, Adam works in partnership with Intelligen Group, a Snowflake pureplay data and analytics consultancy based in Sydney, Australia. He is dedicated to helping his clients to overcome challenges with data while extracting the most value from their data and analytics implementations.
He has also developed a signature training program that includes an intensive online curriculum, weekly live consulting Q&A calls with Adam, and an exclusive mastermind of supportive data and analytics professionals helping you to become an expert in Snowflake. If you’re interested in finding out more, visit www.masteringsnowflake.com.
analyse improve measure repeat
1 年The aws glue feature for data quality sounds exactly like what I have in mind to integrate in my data ops. It’s the most challenging part for us right now to guarantee data stability and understand data quality anomalies.