Cloud & Data Metamorphosis, Part 3.4
Part of my book cover, "Artificial Intelligence & Analytics on AWS" by Kim Schmidt

Cloud & Data Metamorphosis, Part 3.4

Part 3.4 continues from where Video 3, Part 3 left off, which can be found here:

SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4

This 4th video in the 4-part video series “Cloud & Data Metamorphosis, Video 3, Part 4” is the last video in the Video 3 series. Thus, by its very nature, by the end of this 3.4 video, you’ll have a complete foundation for the upcoming videos that are on specific AWS Services. Most of the upcoming videos will be on – but not limited to – AWS Glue, Amazon Athena, & the Amazon S3 Data Lake Storage Platform. But I’m keeping the last video in this series to give you a “wow” factor that hopefully will bring all topics full-circle, with everything summed up in a quite frankly unexpected manner. But as you look back after watching the last video, you’ll see that the end of the video is actually the beginning ?? That last video is entitled, “How to AI & ML Your Apps & Business Processes“.

There’s a ~7 minute video set to 3 songs that represent the journey I hope to take you on through my course & YouTube videos (whose graphics are squares, music stops abruptly & ok, I lingered a bit long on John Lennon’s “Imagine” ?? , & my grammar is atrocious!) that you can find on my YouTube channel entitled, “Learn to Use AWS AI, ML, & Analytics Fast!“. I’ll tempt you with that (I mean bad quality :-0 ) video now…it can be viewed here. Keep in mind that I didn’t “polish” that quickly-created video, but nevertheless, it’s relevant (& FUN!)

Below you’ll find the embedded YouTube Video 3.4:

SECTION/VIDEO 3.4: “Cloud & Data Metamorphosis, Video 3, Part 4” in Text & Screenshots

No alt text provided for this image

This 4th & last video of the video series entitled, “Cloud & Data Metamorphosis”, that augments my Pluralsight course, “Serverless Analytics on AWS”, highlighted with a blue rectangle & pointed to by a blue arrow. Under the blue arrow is the link to that Pluralsight course.

No alt text provided for this image

In Part 3 of “Cloud & Data Metamorphosis“, I covered the following:

  • Serverless Architectures
  • AWS Lambda
  • AWS’ Serverless Application Model, or SAM
  • All About AWS Containers
  • AWS Fargate
  • Amazon Elastic Container Registry (ECR)

In Part 4, I’ll cover:

  • The Evolution of Data Analysis Platform Technologies
  • Serverless Analytics
  • How to Give Redbull to Your Data Transformations
  • AWS Glue & Amazon Athena
  • Clusterless Computing
  • An Introduction to Amazon S3 Data Lake Architectures
No alt text provided for this image

Continuing how technologies have evolved, in this section I’ll cover the Evolution of Data Analysis Platform Technologies.

No alt text provided for this image

The timeline on this slide shows the evolution of data analysis platform technologies.

Around the year 1985, Data Warehouse appliances were the platform of choice. This consisted of multi-core CPUs and networking components with improved storage devices such as Network Attached Storage, or NAS, appliances

Around the year 2006, Hadoop clusters were the platform of choice. This consisted of a Hadoop master node and a network of many computers that provided a software framework for distributed storage and processing of big data using the MapReduce programming model. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. You can see logos for Amazon Elastic MapReduce (EMR) that represent Hadoop frameworks, such as Spark, Hbase, Presto, Hue, Hive, Pig, & Zeppelin.

No alt text provided for this image

Superman is actually an animated gif in the video. Everything Superman is flying towards were created by AWS.

No alt text provided for this image

Around the year 2009, Decoupled EMR clusters were the platform of choice. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), shown here using Amazon S3 for EMRFS (emr file storage), and a processing part which is a MapReduce programming model. This is the first occurrence of compute/memory decoupled from storage. Hadoop again splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality,[6] where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

Around the year 2012, Amazon Redshift Cloud Data Warehouse was introduced, which was a very transformative data analysis platform technology for oh so many awesome reasons! The diagram on the timeline underneath the year 2012 is difficult to read, so I’ll explain them to you. Directly under the written year 2012 is a set of 3 ”sample” client apps that talk to the Leader node & receive information back from the leader node via odbc or jdbc connections. Under the leader node are multiple compute nodes with multiple node slices that have 2-way communication with the leader node as well as a variety of data sources, shown by the 4 icons on the bottom of that image.

Today, the data analysis platform technology of choice is serverless & Clusterless computing via an Amazon S3 data lake, using AWS Glue for ETL & Amazon Athena for SQL, both having 2-way communication with the underlying S3 data store.

No alt text provided for this image

Data is changing, & so must the types of analytics services used to gleen insights from all of the new types of data.

Today, data is captured & stored at PB, EB, & even ZB scale.

Some of the new types of analytics that can be used in a cost-efficient way include the following:

  • Machine Learning
  • Big Data Processing
  • Real-time Analytics
  • And, Full-text Search
No alt text provided for this image

Today, Amazon S3 Data Lakes delivers on-demand analytics with a wide variety & types of analytics. The services on the right of the image above will be covered in depth in the next video, Video 4, that will cover Amazon S3 Data Lakes.

When you use Amazon S3 Data Lakes as your analytics platform storage, some of the benefits you get include the following:

  • There's zero infrastructure & zero administration because everything is serverless
  • You never pay for idle resources; only when you spin them up, use them, & then shut them down
  • Resources automatically scale to handle whatever usage needs you have at the time
  • And, high availability & fault tolerance is built in
No alt text provided for this image

At this point, I’m excited to share with you the two innovative, cutting-edge serverless analytics services provided by AWS: Amazon Athena & AWS Glue! These services are not only cutting edge because they’re state-of-the-art technologies, but also because they’re serverless. Having a cloud-native Serverless architecture enables you to build modern applications with increased agility & lower cost of ownership. It enables you to shift most of your operational & infrastructure management responsibilities to AWS, so you can focus on developing great products that are highly-reliable and scalable. Joining the AWS Services of Glue & Athena is the Amazon S3 Data Lake Platform. S3 Data Lakes will be covered in the next video.

No alt text provided for this image

Data preparation is by far the most difficult & time-consuming tasks when mapping disparate data types for data analytics. 60% of time is spent on cleaning & organizing data. 19% of time is spent collecting datasets. The third most time consuming task is Mining data for patterns. The fourth most time consuming task is Redefining Algorithms. The fifth most time consuming task falls under the broad category of “Other”. The sixth most time consuming task is Building Training Sets for Machine Learning.

The moral of this story is there HAS TO BE A SOLUTION TO DECREASE THE TIME SPENT ON ALL THESE TASKS! Well, there is, & I can’t wait to share it with you! I’ll begin in the next few slides then elaborate more in the next video.

No alt text provided for this image

AWS Glue solves the business problems of heterogeneous data transformation and globally siloed data.

Let’s look at what AWS Glue “Is”:

  • The AWS Glue Data Catalog provides 1 Location for All Globally Siloed Data – NO MATTER WHERE IN THE WORLD THE UNDERLYING DATA STORE IS!
  • AWS Glue Crawlers crawls global data sources, populates the Glue Data Catalog with enough metadata & statistics to recreate the data set when needed for analytics, & keeps the Data Catalog in sync with all changes to data, located across the globe
  • AWS Glue automatically identifies data formats & data types
  • AWS Glue has built-in Error Handling
  • AWS Glue Jobs perform the data transformation, which can be automated via a Glue Job Scheduler, either event-based or time-based
  • AWS Glue ETL is one of the common data transformations AWS Glue does, but there are many other data transformations built-in
  • AWS Glue has monitoring and alerting built in!
  • And, AWS Glue ELIMINATES DARK DATA
No alt text provided for this image

Amazon Athena solves the business problems of heterogeneous data analytics & gives the ability to instantaneously query data without ETL.

Let’s look at what Amazon Athena “Is”:

  • Amazon Athena is an interactive query service
  • You query data directly from S3 using ANSI SQL
  • You can analyze unstructured, semi-structured, & structured data
  • Athena scales automatically
  • Query execution is extremely fast via executing queries in parallel
  • You can query encrypted data in S3 & write encrypted data back to another S3 bucket
  • And, you only pay for the queries you run
No alt text provided for this image

I’ll now touch on the Data Architecture evolution regarding Amazon S3 Data Lakes.

Serverless Architectures remove most of the needs for traditional “always on” server components. The term “CLUSTERLESS” means architectures today don’t need 2 or more computers working at the same time, thus these services are both Serverless & Clusterless.

AWS Glue, Amazon Athena, & Amazon S3 are the 3 core services that make AWS Data Lake Architectures possible!!! These 3 AWS Services are pretty AMAZING!

Under Amazon Athena’s covers is both Presto & Apache Hive. Presto is an in-memory distributed SQL query engine used for DML (Data Manipulation Language)…like CREATE, SELECT, ALTER, & DELETE. It can query data where it is stored, without needing to move data into a separate analytics system. Query execution runs in parallel over a pure memory-based architecture, with most results returning in seconds. It can perform interactive data analysis against GB to PB of data. And, it’s ANSI-SQL compatible with extensions.

Hive is used to execute DDL statements… “Data Definition Language”, a subset of SQL statements that change the structure of the database schema in some way, typically by creating, deleting, or modifying schema objects such as databases tables, and views in Amazon Athena. It can work with complex data types & work with a plethora of data formats. It’s used by Amazon Athena to partition data. Hive also supports MSCK REPAIR TABLE (or, ALTER TABLE RECOVER PARTITIONS), to recover partitions and data associated with partitions.

AWS Glue builds on Apache Spark to offer ETL-specific functionality. Apache Spark is a high-performance, in-memory data processing framework that can perform large-scale data processing. You can configure your AWS Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore because the Data Catalog is Hive-metastore compatible. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. By using SparkSQL you can use existing Hive metastores, SerDes, & UDFs.

Either way, serverless architectures provide all the benefits of cloud computing but with considerably less time-to-create, maintain, monitor, & more & at an amazing cost-savings!

I’m going to end this video with some awesome quotes…

No alt text provided for this image

Hopefully you remember that we’re in the midst of the 4th Industrial Revolution from Video 2 in this series? On the above slide is a quote from Miguel Milano, President, EMEA, Salesforce. It reads, “The technology driving the 4th Industrial Revolution gives companies an ever-greater influence on how we live, work, & function as a society. As a result, customers expect business leaders to go beyond shareholders & have a longer-term, broader strategic approach“. In other words, what worked yesterday will not work today, at least not for long. Businesses & ourselves HAVE to keep up with the rapid pace of technological change. “It’s the end of the world as we know it!

No alt text provided for this image

The quote above is from Jeff Bezos, Founder & CEO of Amazon.com. It reads, “In today’s era of volatility, there’s no other way but to re:invent. The only sustainable advantage you can have over others is agility, that’s it. Because, nothing else is sustainable, everything else you create, someone else will replicate.”

It’s interesting to watch how blazingly fast everything we do in life is recorded through the very technologies that also help us. People have always stood on the shoulder’s of giants to bootstrap their careers, but this time it’s different. If you don’t learn AI today, it’s a sobering fact that you’ll fall behind the pack. So, keep up with my course & these videos, stand on my “dwarf” (not giant! I’m REALLY SHORT!) shoulders, & be tenacious to thrive in the 4th Industrial Revolution. I know you can do it!

No alt text provided for this image

Keeping the last quote in mind, read the next quote above. It’s a quote from Brendan Witcher, Principal Analyst at Forrester. It reads, “You’re not behind your competitors; you’re behind your customers – beyond their expectations“. There’s 2 concepts I’d like you to ponder here. First, although you need to keep up & ideally surpass your competition, what your customers want, & KNOWING WHAT THAT IS, is always the way to approach a business or a job, & AI can tell you that in real-time. Secondly, as your competitors offer state-of-the-art AI solutions, they will come to expect that from anyone they choose to do business with.

No alt text provided for this image

The last quote I’ll leave you with is a bit of an extension of the last quote. This one is from Theodore Levitt, Former Harvard Business School Marketing Professor. It reads, “People don’t buy a quarter-inch drill. They want a quarter-inch hole“. Acquiring the talent to decipher customer’s requests into what they really mean but perhaps can’t articulate is a valuable characteristic indeed!

No alt text provided for this image

This is the end of multi-video series 3 in the parent multi-video series “AI, ML, & Advanced Analytics on AWS”. In the next video in this series, video #4, I’ll go into depth on just how cool the Amazon S3 Data Lake Storage Platform is & why . I’ll also describe how AWS Glue & Amazon Athena fit into that platform. I think you’ll be amazed at how these technologies together, & other AWS Services, provide a complete portfolio of data exploration, reporting, analytics, machine learning, AI and visualization tools to use on all of your data.

By the way, every top-level video in this series will end with this slide with this image on it & the url at the top. The URL leads you to my book site, where you can download a 99-pg chapter on how to create a predictive analytics workflow on AWS using Amazon SageMaker, Amazon DynamoDB, AWS Lambda, & some other really awesome AWS technologies. The reason the chapter is 99 pages long is because I leave no black boxes, so that people who are advanced at analytics can get something out of it as well as a complete novice. I walk readers through step-by-step in creating the workflow & describe why each service is used & what its doing. Note however, that it’s not fully edited, but there’s a lot of content as I walk you through building the architecture with a lot of screenshots, so you can confirm you’re following the instructions correctly.

I’ll get Video 4 up asap!

#gottaluvAWS!


要查看或添加评论,请登录

Kim Schmidt的更多文章

社区洞察

其他会员也浏览了