登录查看更多内容

Cloud & Data Metamorphosis, Part 3.4

Kim Schmidt

Technical Trainer III, Amazon Web Services (AWS)

发布日期: 2019年11月23日

+ 关注

Part 3.4 continues from where Video 3, Part 3 left off, which can be found here:

SECTION/VIDEO 3.4: ““Cloud & Data Metamorphosis, Video 3, Part 4“

This 4th video in the 4-part video series “Cloud & Data Metamorphosis, Video 3, Part 4” is the last video in the Video 3 series. Thus, by its very nature, by the end of this 3.4 video, you’ll have a complete foundation for the upcoming videos that are on specific AWS Services. Most of the upcoming videos will be on – but not limited to – AWS Glue, Amazon Athena, & the Amazon S3 Data Lake Storage Platform. But I’m keeping the last video in this series to give you a “wow” factor that hopefully will bring all topics full-circle, with everything summed up in a quite frankly unexpected manner. But as you look back after watching the last video, you’ll see that the end of the video is actually the beginning ?? That last video is entitled, “How to AI & ML Your Apps & Business Processes“.

There’s a ~7 minute video set to 3 songs that represent the journey I hope to take you on through my course & YouTube videos (whose graphics are squares, music stops abruptly & ok, I lingered a bit long on John Lennon’s “Imagine” ?? , & my grammar is atrocious!) that you can find on my YouTube channel entitled, “Learn to Use AWS AI, ML, & Analytics Fast!“. I’ll tempt you with that (I mean bad quality :-0 ) video now…it can be viewed here. Keep in mind that I didn’t “polish” that quickly-created video, but nevertheless, it’s relevant (& FUN!)

Below you’ll find the embedded YouTube Video 3.4:

SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4” in Text & Screenshots

This 4th & last video of the video series entitled, “Cloud & Data Metamorphosis”, that augments my Pluralsight course, “Serverless Analytics on AWS”, highlighted with a blue rectangle & pointed to by a blue arrow. Under the blue arrow is the link to that Pluralsight course.

In Part 3 of “Cloud & Data Metamorphosis“, I covered the following:

Serverless Architectures
AWS Lambda
AWS’ Serverless Application Model, or SAM
All About AWS Containers
AWS Fargate
Amazon Elastic Container Registry (ECR)

In Part 4, I’ll cover:

The Evolution of Data Analysis Platform Technologies
Serverless Analytics
How to Give Redbull to Your Data Transformations
AWS Glue & Amazon Athena
Clusterless Computing
An Introduction to Amazon S3 Data Lake Architectures

Continuing how technologies have evolved, in this section I’ll cover the Evolution of Data Analysis Platform Technologies.

The timeline on this slide shows the evolution of data analysis platform technologies.

Around the year 1985, Data Warehouse appliances were the platform of choice. This consisted of multi-core CPUs and networking components with improved storage devices such as Network Attached Storage, or NAS, appliances

Around the year 2006, Hadoop clusters were the platform of choice. This consisted of a Hadoop master node and a network of many computers that provided a software framework for distributed storage and processing of big data using the MapReduce programming model. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. You can see logos for Amazon Elastic MapReduce (EMR) that represent Hadoop frameworks, such as Spark, Hbase, Presto, Hue, Hive, Pig, & Zeppelin.

Superman is actually an animated gif in the video. Everything Superman is flying towards were created by AWS.

Around the year 2009, Decoupled EMR clusters were the platform of choice. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), shown here using Amazon S3 for EMRFS (emr file storage), and a processing part which is a MapReduce programming model. This is the first occurrence of compute/memory decoupled from storage. Hadoop again splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality,[6] where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Around the year 2012, Amazon Redshift Cloud Data Warehouse was introduced, which was a very transformative data analysis platform technology for oh so many awesome reasons! The diagram on the timeline underneath the year 2012 is difficult to read, so I’ll explain them to you. Directly under the written year 2012 is a set of 3 ”sample” client apps that talk to the Leader node & receive information back from the leader node via odbc or jdbc connections. Under the leader node are multiple compute nodes with multiple node slices that have 2-way communication with the leader node as well as a variety of data sources, shown by the 4 icons on the bottom of that image.

Today, the data analysis platform technology of choice is serverless & Clusterless computing via an Amazon S3 data lake, using AWS Glue for ETL & Amazon Athena for SQL, both having 2-way communication with the underlying S3 data store.

Data is changing, & so must the types of analytics services used to gleen insights from all of the new types of data.

Today, data is captured & stored at PB, EB, & even ZB scale.

Some of the new types of analytics that can be used in a cost-efficient way include the following:

Machine Learning
Big Data Processing
Real-time Analytics
And, Full-text Search

Today, Amazon S3 Data Lakes delivers on-demand analytics with a wide variety & types of analytics. The services on the right of the image above will be covered in depth in the next video, Video 4, that will cover Amazon S3 Data Lakes.

When you use Amazon S3 Data Lakes as your analytics platform storage, some of the benefits you get include the following:

There's zero infrastructure & zero administration because everything is serverless
You never pay for idle resources; only when you spin them up, use them, & then shut them down
Resources automatically scale to handle whatever usage needs you have at the time
And, high availability & fault tolerance is built in

At this point, I’m excited to share with you the two innovative, cutting-edge serverless analytics services provided by AWS: Amazon Athena & AWS Glue! These services are not only cutting edge because they’re state-of-the-art technologies, but also because they’re serverless. Having a cloud-native Serverless architecture enables you to build modern applications with increased agility & lower cost of ownership. It enables you to shift most of your operational & infrastructure management responsibilities to AWS, so you can focus on developing great products that are highly-reliable and scalable. Joining the AWS Services of Glue & Athena is the Amazon S3 Data Lake Platform. S3 Data Lakes will be covered in the next video.

Data preparation is by far the most difficult & time-consuming tasks when mapping disparate data types for data analytics. 60% of time is spent on cleaning & organizing data. 19% of time is spent collecting datasets. The third most time consuming task is Mining data for patterns. The fourth most time consuming task is Redefining Algorithms. The fifth most time consuming task falls under the broad category of “Other”. The sixth most time consuming task is Building Training Sets for Machine Learning.

The moral of this story is there HAS TO BE A SOLUTION TO DECREASE THE TIME SPENT ON ALL THESE TASKS! Well, there is, & I can’t wait to share it with you! I’ll begin in the next few slides then elaborate more in the next video.

AWS Glue solves the business problems of heterogeneous data transformation and globally siloed data.

Let’s look at what AWS Glue “Is”:

The AWS Glue Data Catalog provides 1 Location for All Globally Siloed Data – NO MATTER WHERE IN THE WORLD THE UNDERLYING DATA STORE IS!
AWS Glue Crawlers crawls global data sources, populates the Glue Data Catalog with enough metadata & statistics to recreate the data set when needed for analytics, & keeps the Data Catalog in sync with all changes to data, located across the globe
AWS Glue automatically identifies data formats & data types
AWS Glue has built-in Error Handling
AWS Glue Jobs perform the data transformation, which can be automated via a Glue Job Scheduler, either event-based or time-based
AWS Glue ETL is one of the common data transformations AWS Glue does, but there are many other data transformations built-in
AWS Glue has monitoring and alerting built in!
And, AWS Glue ELIMINATES DARK DATA

Amazon Athena solves the business problems of heterogeneous data analytics & gives the ability to instantaneously query data without ETL.

Let’s look at what Amazon Athena “Is”:

Amazon Athena is an interactive query service
You query data directly from S3 using ANSI SQL
You can analyze unstructured, semi-structured, & structured data
Athena scales automatically
Query execution is extremely fast via executing queries in parallel
You can query encrypted data in S3 & write encrypted data back to another S3 bucket
And, you only pay for the queries you run

I’ll now touch on the Data Architecture evolution regarding Amazon S3 Data Lakes.

Serverless Architectures remove most of the needs for traditional “always on” server components. The term “CLUSTERLESS” means architectures today don’t need 2 or more computers working at the same time, thus these services are both Serverless & Clusterless.

AWS Glue, Amazon Athena, & Amazon S3 are the 3 core services that make AWS Data Lake Architectures possible!!! These 3 AWS Services are pretty AMAZING!

Under Amazon Athena’s covers is both Presto & Apache Hive. Presto is an in-memory distributed SQL query engine used for DML (Data Manipulation Language)…like CREATE, SELECT, ALTER, & DELETE. It can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. It can perform interactive data analysis against GB to PB of data. And, it’s ANSI-SQL compatible with extensions.

Hive is used to execute DDL statements… “Data Definition Language”, a subset of SQL statements that change the structure of the database schema in some way, typically by creating, deleting, or modifying schema objects such as databases tables, and views in Amazon Athena. It can work with complex data types & work with a plethora of data formats. It’s used by Amazon Athena to partition data. Hive also supports MSCK REPAIR TABLE (or, ALTER TABLE RECOVER PARTITIONS), to recover partitions and data associated with partitions.

AWS Glue builds on Apache Spark to offer ETL-specific functionality. Apache Spark is a high-performance, in-memory data processing framework that can perform large-scale data processing. You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore because the Data Catalog is Hive-metastore compatible. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. By using SparkSQL you can use existing Hive metastores, SerDes, & UDFs.

Either way, serverless architectures provide all the benefits of cloud computing but with considerably less time-to-create, maintain, monitor, & more & at an amazing cost-savings!

I’m going to end this video with some awesome quotes…

Hopefully you remember that we’re in the midst of the 4th Industrial Revolution from Video 2 in this series? On the above slide is a quote from Miguel Milano, President, EMEA, Salesforce. It reads, “The technology driving the 4th Industrial Revolution gives companies an ever-greater influence on how we live, work, & function as a society. As a result, customers expect business leaders to go beyond shareholders & have a longer-term, broader strategic approach“. In other words, what worked yesterday will not work today, at least not for long. Businesses & ourselves HAVE to keep up with the rapid pace of technological change. “It’s the end of the world as we know it!”

The quote above is from Jeff Bezos, Founder & CEO of Amazon.com. It reads, “In today’s era of volatility, there’s no other way but to re:invent. The only sustainable advantage you can have over others is agility, that’s it. Because, nothing else is sustainable, everything else you create, someone else will replicate.”

It’s interesting to watch how blazingly fast everything we do in life is recorded through the very technologies that also help us. People have always stood on the shoulder’s of giants to bootstrap their careers, but this time it’s different. If you don’t learn AI today, it’s a sobering fact that you’ll fall behind the pack. So, keep up with my course & these videos, stand on my “dwarf” (not giant! I’m REALLY SHORT!) shoulders, & be tenacious to thrive in the 4th Industrial Revolution. I know you can do it!

Keeping the last quote in mind, read the next quote above. It’s a quote from Brendan Witcher, Principal Analyst at Forrester. It reads, “You’re not behind your competitors; you’re behind your customers – beyond their expectations“. There’s 2 concepts I’d like you to ponder here. First, although you need to keep up & ideally surpass your competition, what your customers want, & KNOWING WHAT THAT IS, is always the way to approach a business or a job, & AI can tell you that in real-time. Secondly, as your competitors offer state-of-the-art AI solutions, they will come to expect that from anyone they choose to do business with.

The last quote I’ll leave you with is a bit of an extension of the last quote. This one is from Theodore Levitt, Former Harvard Business School Marketing Professor. It reads, “People don’t buy a quarter-inch drill. They want a quarter-inch hole“. Acquiring the talent to decipher customer’s requests into what they really mean but perhaps can’t articulate is a valuable characteristic indeed!

This is the end of multi-video series 3 in the parent multi-video series “AI, ML, & Advanced Analytics on AWS”. In the next video in this series, video #4, I’ll go into depth on just how cool the Amazon S3 Data Lake Storage Platform is & why . I’ll also describe how AWS Glue & Amazon Athena fit into that platform. I think you’ll be amazed at how these technologies together, & other AWS Services, provide a complete portfolio of data exploration, reporting, analytics, machine learning, AI and visualization tools to use on all of your data.

By the way, every top-level video in this series will end with this slide with this image on it & the url at the top. The URL leads you to my book site, where you can download a 99-pg chapter on how to create a predictive analytics workflow on AWS using Amazon SageMaker, Amazon DynamoDB, AWS Lambda, & some other really awesome AWS technologies. The reason the chapter is 99 pages long is because I leave no black boxes, so that people who are advanced at analytics can get something out of it as well as a complete novice. I walk readers through step-by-step in creating the workflow & describe why each service is used & what its doing. Note however, that it’s not fully edited, but there’s a lot of content as I walk you through building the architecture with a lot of screenshots, so you can confirm you’re following the instructions correctly.

I’ll get Video 4 up asap!

#gottaluvAWS!

要查看或添加评论，请登录

Kim Schmidt的更多文章

Cloud & Data Metamorphosis, Part 3.3

2019年11月23日

Cloud & Data Metamorphosis, Part 3.3

SECTION/VIDEO 3.3: “Cloud & Data Netamorphosis, Video 3, Part 3“ Part 3 of Video 3 continues from where Video 3, Part 2…
Cloud & Data Metamorphosis, Part 3.2

2019年11月23日

Cloud & Data Metamorphosis, Part 3.2

This is a continuation of the article "Cloud & Data Metamorphosis, Part 1" found here: https://www.linkedin.
Cloud & Data Metamorphosis, Part 3.1

2019年11月23日

Cloud & Data Metamorphosis, Part 3.1

This video/article is a continuation of "Advanced Data & Emerging Technologies", Video 2 in the series "AI/ML &…
Advanced Data & Emerging Technologies, Video 2 of Series

2019年11月23日

Advanced Data & Emerging Technologies, Video 2 of Series

This is a continuation of the "Explainer Video 1" in this series. You can find that first video & article at…
AI, ML, & Advanced Analytics on AWS Series, Video 1/Text 1

2019年11月22日

AI, ML, & Advanced Analytics on AWS Series, Video 1/Text 1

I find it interesting that essentially the quote from Charles Darwin stated over 100 years ago is being quoted today by…
6 Hour Video Tutorial on AWS AI, ML, AWS Glue, Amazon Athena, & Amazon S3 Data Lakes

2019年8月20日

6 Hour Video Tutorial on AWS AI, ML, AWS Glue, Amazon Athena, & Amazon S3 Data Lakes

How Would You Like 6 Hours of Video Tutorials that Will Comprehensively Teach You How to Use the Best, State-of-the-Art…

4 条评论
AI / ML Your Apps & Business Processes Slides

2019年2月9日

AI / ML Your Apps & Business Processes Slides

Augment the slideshow above with the notes that accompany each slide. You can download the presentation from:…
IMAGINE if Everyone Improved their Businesses with Cloud Technologies & Advanced Analytics like AI & ML

2018年6月23日

IMAGINE if Everyone Improved their Businesses with Cloud Technologies & Advanced Analytics like AI & ML

I'm writing a book entitled "Artificial Intelligence & Analytics on AWS". I'd like to explain some of the reasons I'm…

1 条评论
Kim's Sample Work

2017年11月6日

Kim's Sample Work

I created this quick sample set of links I could readily find in regard to samples of my writing, videos, etc that I…
Bringing Predictive Data Analytics to the People with PredicSis.ai

2017年8月31日

Bringing Predictive Data Analytics to the People with PredicSis.ai

BY FRANK for DATALEADER · JULY 24, 2017 Over the last decade, the term “big data” grew to prominence. The rush was on…

1 条评论

See all articles

Cloud & Data Metamorphosis, Part 3.4

Kim Schmidt

Technical Trainer III, Amazon Web Services (AWS)

Kim Schmidt的更多文章

社区洞察

其他会员也浏览了

Amazon Redshift’s Top Performance Features and Latest Capabilities

AWS re:Invent 2022 - Part Three

Get Certified: Google Cloud Platform Professional Data Engineer

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Unlocking the Power of Metadata in AWS S3 for Smarter Data Governance and Compliance.

Azure Blob Storage Advantages and Disadvantages: Full Guide

Why Google Cloud on Data Analytics?

Top Tips for AWS S3 Performance Optimization

The Best of BigQuery From Google Cloud Next'21 Via My Lens. ??

Choosing the Right Data Lake Platform: Azure vs. Google Cloud Platform

Kim Schmidt的更多文章

Cloud & Data Metamorphosis, Part 3.3

Cloud & Data Metamorphosis, Part 3.2

Cloud & Data Metamorphosis, Part 3.1

Advanced Data & Emerging Technologies, Video 2 of Series

AI, ML, & Advanced Analytics on AWS Series, Video 1/Text 1

6 Hour Video Tutorial on AWS AI, ML, AWS Glue, Amazon Athena, & Amazon S3 Data Lakes

AI / ML Your Apps & Business Processes Slides

IMAGINE if Everyone Improved their Businesses with Cloud Technologies & Advanced Analytics like AI & ML

Kim's Sample Work

Bringing Predictive Data Analytics to the People with PredicSis.ai

社区洞察

其他会员也浏览了

Amazon Redshift’s Top Performance Features and Latest Capabilities

AWS re:Invent 2022 - Part Three

Get Certified: Google Cloud Platform Professional Data Engineer

Reading from Azure DataLake & Writing to Google BigQuery via Databricks

Unlocking the Power of Metadata in AWS S3 for Smarter Data Governance and Compliance.

Azure Blob Storage Advantages and Disadvantages: Full Guide

Why Google Cloud on Data Analytics?

Top Tips for AWS S3 Performance Optimization

The Best of BigQuery From Google Cloud Next'21 Via My Lens. ??

Choosing the Right Data Lake Platform: Azure vs. Google Cloud Platform