登录查看更多内容

AWS & Snowflake | Better Together | Highlights from Re:Invent 2022

Adam Morton

Empowering data leaders with tech-agnostic, ROI-driven data strategies, design and execution | Best-Selling Author | Founder of Mastering Snowflake Program

发布日期: 2023年3月10日

Thank you for reading my latest article AWS & Snowflake | Better Together | Highlights from Re:Invent 2022

Here at LinkedIn I regularly write about modern data platforms and technology trends.To read my future articles simply join my network here or click 'Follow'. Also feel free to connect with me via YouTube.

------------------------------------------------------------------------------------------------------------

The landscape of data, analytics and machine learning is constantly in flux. It’s a hugely competitive landscape which sees vendors of all shapes and sizes compete for a slice of a market which is now estimated to be worth a staggering USD 41.39 billion last year. And, if that’s not a big enough number to get your head around just wait until you hear how much it's predicted to be worth in 2030; USD 364.33 billion! Crazy.

AWS re:Invent held towards the very end of 2022 where Amazon came to the xmas party early with a flurry of interesting announcements. In this article I’ll pick the key ones which stood out to me. AWS has such a large footprint that they hold a distinct level of influence over where the market may go next, so you can think of this as a barometer of sorts for how data and analytics capabilities may evolve over the next 12 months.

Amazon DataZone

Organizations are often aware that it is important to leverage and secure their data assets, but, as they grow to petabyte scale and beyond, data invariably starts to penetrate across multiple departments, data lakes and silos. Attempting to bring this all together to ensure it's available to the right people at the right time is hard enough, and when you bring security into the mix then it creates a real challenge to strike the right balance.

AWS DataZone attempts to wrap this up in a Data Management as a Service which is designed as a way to enable businesses to bring all of their data together for use anywhere within the organization, along with granular management of essential features such as permissioning, security, and data governance. DataZone attempts to securely democratize data whilst providing a pathway to identify and make visible data silos so users across the organization can share, search, and discover data at scale across departments. The diagram below shows how DataZone is positioned within your data landscape, the idea being that you can log into a web portal and manage all of your data assets from one place.

No alt text provided for this image — AWS Data Zone

Redshift

Central to the AWS data landscape is their version of the modern data platform, Redshift. Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. At re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.

Organizations are often faced with the challenge of attempting to consolidate data from multiple disparate data sources into an enterprise data warehouse such as Redshift in order to generate actionable insights. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse - which takes a lot of time, effort and money.

Auto-Copy from Amazon S3

One feature which aims to alleviate this pain is auto-copy from Amazon S3. Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. The diagram below illustrates this process:

This capability brings with it several key benefits:

Data is automatically transferred to Redshift as it arrives.
There is no additional cost to access this feature.
Data Engineers do not need to worry about building and testing pipelines or a framework to load the data.

Amazon Aurora zero-ETL integration with Amazon Redshift

In a process similar to the Amazon S3 auto-copy feature described above Redshift is now able to auto-ingest data from one or more Aurora MySQL databases into one Redshift cluster at low latency.

With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. I’d expect in the future that AWS will introduce more granular controls, such as selecting individual schemas and tables rather than the entire database.

Many customers will already leverage Aurora to store transactional data, so the ability to quickly and efficiently move this data to Redshift for near real time analysis will be highly compelling.?

领英推荐

Your 4 step guide to copying data from AWS S3 to…

Adam Morton 1 年前

Day 1 of Databricks vs Snowflake vs Fabric: Getting…

Josue A. Bogran 6 个月前

Day 1: A Balanced Assessment of Databricks

Josue A. Bogran 5 个月前

Multi-AZ support

Another feature announced at re:Invent was around the resilience and high availability topic. Customers can now run Redshift across multiple availability zones which allows for data replication across different geographical regions to mitigate against outages or failures. Essentially, this acts as an insurance policy to ensure business critical data remains available in the event of disaster. AWS takes care of the automatic recovery while the Redshift data warehouse behaves like one centralized data warehouse to all users.?

Dynamic Data Masking

Dynamic data masking has also been introduced in preview to Redshift. This enables you to load the data once into Redshift and automatically overlay masking policies at execution time to prevent unauthorized access from those users and applications based upon their role. This simplifies the management and accessibility of storing sensitive data in your data warehouse. Previously, customers often maintained multiple copies of data, or layers upon layers of complex views to cater for these requirements.

For more information on Dynamic Data Masking and how it works on Snowflake check out my video here.

Athena for Apache Spark

Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size. However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Despite this the Spark framework is very popular - I know of several customers who use PySpark data pipelines across their AWS infrastructure.

Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.

Amazon Athena integrates with the AWS Glue Data Catalog, which inspects and creates a taxonomy of your data assets. This allows customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. Data scientists can then leverage this information to analyze, visualize and explore data in order to prepare data sets for machine learning pipelines.

AWS Glue Data Quality

When it comes to working on AI and machine learning use cases data quality is certainly an important factor when attempting to plan for a successful outcome. A new feature announced for the data integration service, AWS glue is a data quality component which attempts to identify ‘pollution’ in your data lake.?

It can analyze your tables and recommend a set of rules automatically based on what it finds. You can fine-tune those rules if necessary and you can also write your own rules. Each rule relates to a table or columns within a table which validates properties such as timeliness, accuracy and integrity. For example, a rule can indicate that a table must have the expected number of columns, that the column names match a desired pattern, and that a specific column is usable as a primary key.

To stay up to date with the latest business and tech trends in data and analytics, make sure to subscribe to my newsletter, follow me on LinkedIn, and YouTube, and, if you’re interested in taking a deeper dive into Snowflake check out my books ‘Mastering Snowflake Solutions’ and ‘SnowPro Core Certification Study Guide’.

------------------------------------------------------------------------------------------------------------

About Adam Morton

Adam Morton is an experienced data leader and author in the field of data and analytics with a passion for delivering tangible business value. Over the past two decades Adam has accumulated a wealth of valuable, real-world experiences designing and implementing enterprise-wide data strategies, advanced data and analytics solutions as well as building high-performing data teams across the UK, Europe, and Australia.?

Adam’s continued commitment to the data and analytics community has seen him formally recognised as an international leader in his field when he was awarded a Global Talent Visa by the Australian Government in 2019.

Today, Adam works in partnership with Intelligen Group, a Snowflake pureplay data and analytics consultancy based in Sydney, Australia. He is dedicated to helping his clients to overcome challenges with data while extracting the most value from their data and analytics implementations.

He has also developed a signature training program that includes an intensive online curriculum, weekly live consulting Q&A calls with Adam, and an exclusive mastermind of supportive data and analytics professionals helping you to become an expert in Snowflake. If you’re interested in finding out more, visit www.masteringsnowflake.com.

Modern Tech Talks

4,771 位关注者

Benjamin Gnichwitz

analyse improve measure repeat

1 年

The aws glue feature for data quality sounds exactly like what I have in mind to integrate in my data ops. It’s the most challenging part for us right now to guarantee data stability and understand data quality anomalies.

1 次回应

要查看或添加评论，请登录

Adam Morton的更多文章

Why Speed Requires New Skills

2025年3月27日

Why Speed Requires New Skills

Thank you for reading my latest article Why Speed Requires New Skills At Future Proof, I regularly explore the evolving…

1 条评论
Never Give Up

2025年3月19日

Never Give Up

The Religion of Hope Growing up in Newcastle, there's no decision to make about which football (soccer) team to…

1 条评论
The Snowflake MFA Challenge: Read this before you lose access!

2025年3月12日

The Snowflake MFA Challenge: Read this before you lose access!

Thank you for reading my latest article The Snowflake MFA Challenge: Supporting Legacy Applications. Here at LinkedIn I…

3 条评论
The Self-Driving Data Architecture - Your Roadmap from Manual to Autonomous

2025年3月5日

The Self-Driving Data Architecture - Your Roadmap from Manual to Autonomous

Thank you for reading my latest article The Self-Driving Data Architecture - Your Roadmap from Manual to Autonomous…
Why Most Data Projects Capsize (And How to Keep Yours Afloat)

2025年3月4日

Why Most Data Projects Capsize (And How to Keep Yours Afloat)

Here at LinkedIn I regularly write about modern data platforms and technology trends. To read my future articles simply…
Breaking Free: Why Composable Architecture is Your Data Strategy's Missing Piece

2025年2月26日

Breaking Free: Why Composable Architecture is Your Data Strategy's Missing Piece

Thank you for reading my latest article Breaking Free: Why Composable Architecture is Your Data Strategy's Missing…

2 条评论
When Everyone's a Data Scientist, No One Is

2025年2月26日

When Everyone's a Data Scientist, No One Is

Thank you for reading my latest article When Everyone's a Data Scientist, No One Is At Future Proof, I regularly…

1 条评论
The Hidden Cost of AI Enthusiasm

2025年2月19日

The Hidden Cost of AI Enthusiasm

Thank you for reading my latest article The Hidden Cost of AI Enthusiasm. Here at LinkedIn I regularly write about…

1 条评论
The Roof is on FIRE!

2025年2月12日

The Roof is on FIRE!

Thank you for reading my latest article The Roof is on FIRE. At Future Proof, I regularly explore the evolving…
Snowflake Expands to Mexico and South Korea: What It Means for You

2024年12月2日

Snowflake Expands to Mexico and South Korea: What It Means for You

Thank you for reading my latest article Snowflake Expands to Mexico and South Korea: What It Means for You. Here at…

1 条评论

See all articles

AWS & Snowflake | Better Together | Highlights from Re:Invent 2022

Adam Morton

Empowering data leaders with tech-agnostic, ROI-driven data strategies, design and execution | Best-Selling Author | Founder of Mastering Snowflake Program

领英推荐

Modern Tech Talks

4,771 位关注者

Adam Morton的更多文章

社区洞察

其他会员也浏览了

2 years & 1 transformational product: Remembering the launch of ThoughtSpot Embrace

AWS Glue Data Catalog as the Metastore for Databricks

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Can Snowflake forge a path to $50bn?

AWS update of Week 10 (6Mar-12Mar)

Modern Data Platforms on AWS, Part 1: Services to Extract and Manipulate Data

Building Your Own AWS Glue Bookmark: A Guide to Retrieving Only New Incremental Files

Key Differences Between Databricks Standard and Premium tiers

Building, Implementing and Leading a CoE for Data and Analytics with AWS

领英推荐

Modern Tech Talks

4,771 位关注者

Adam Morton的更多文章

Why Speed Requires New Skills

Never Give Up

The Snowflake MFA Challenge: Read this before you lose access!

The Self-Driving Data Architecture - Your Roadmap from Manual to Autonomous

Why Most Data Projects Capsize (And How to Keep Yours Afloat)

Breaking Free: Why Composable Architecture is Your Data Strategy's Missing Piece

When Everyone's a Data Scientist, No One Is

The Hidden Cost of AI Enthusiasm

The Roof is on FIRE!

Snowflake Expands to Mexico and South Korea: What It Means for You

社区洞察

其他会员也浏览了

2 years & 1 transformational product: Remembering the launch of ThoughtSpot Embrace

AWS Glue Data Catalog as the Metastore for Databricks

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Can Snowflake forge a path to $50bn?

AWS update of Week 10 (6Mar-12Mar)

Modern Data Platforms on AWS, Part 1: Services to Extract and Manipulate Data

Building Your Own AWS Glue Bookmark: A Guide to Retrieving Only New Incremental Files

Key Differences Between Databricks Standard and Premium tiers

Building, Implementing and Leading a CoE for Data and Analytics with AWS