登录查看更多内容

DATA LAKES

Ashutosh K.

Ex banker, Now self-employed, MD &CEO of Kumar Group of companies, Author of many books.

发布日期: 2023年12月10日

+ 关注

Data Lakes & Serverless ComputingData lake Data lakehouse Data warehouse

Types of data

All types: Structured data, semi-structured data, unstructured (raw) data

Structured data only

Cost

Introduction

?Challa, who has more than 15 years of experience in architecting, designing, building and implementing IT solutions for various verticals including academic, financial and fundraising organizations, started the session by polling the audience on their understanding of data lake and server less computing. On August 19, 2020, more than 50 participants gathered virtually to hear?Aditya “Adi” Challa, AWS Solutions Architect with Amazon Web Services (AWS), carefully walk through how to build and automate a modern server less data lake on AWS as part of ASU UTO’s?Innov8: A Speaker Series.

Defining Data Lakes & Serverless Computing

A data lake, a system or repository of data stored in its natural/raw format, usually represents the single store of all data from an enterprise. It can be established “on premises” (within an organization’s data centres) or “in the cloud” (cloud services from vendors like Amazon Web Services), and data lakes are essential to the maintenance of an organization’s crucial information. Serverless computing can be used in support of data lakes, where a cloud provider runs the server and dynamically manages resources.?There are more than 10,000 serverless data lakes that are currently being built and maintained on AWS.

What is a data lake?

A data lake is a central location that holds a large amount of data in its native, raw format. Compared to a hierarchical data warehouse, which stores data in files or folders, a data lake uses a flat architecture and object storage to store the data. Object storage stores data with metadata tags and a unique identifier, which makes it easier to locate and retrieve data across regions, and improves performance. By leveraging inexpensive object storage and open formats, data lakes enable many applications to take advantage of the data.

Data lakes were developed in response to the limitations of data warehouses. While data warehouses provide businesses with highly performant and scalable analytics, they are expensive and proprietary and can't handle the modern use cases most companies are looking to address. Data lakes are often used to consolidate all of an organization’s data in a single, central location, where it can be saved “as is,” without the need to impose a schema (i.e., a formal structure for how the data is organized) up front like a data warehouse does. Data in all stages of the refinement process can be stored in a data lake: raw data can be ingested and stored right alongside an organization’s structured, tabular data sources (like database tables), as well as intermediate data tables generated in the process of refining raw data. Unlike most databases and data warehouses, data lakes can process all data types — including unstructured and semi-structured data like images,?video, audio and documents — which are critical for today’s machine learning and advanced analytics use cases.

Methods ?to Building a Data Lake & Common Misconceptions of Data Lakes

?There are five typical steps in building a data lake:

1.?? Set up storage

2.?? Move data

3.?? Cleanse, prep, and catalogue data

4.?? Configure and enforce security and compliance policies

5.?? Make data available for analytics

The whole purpose of the data lake is to democratize access to this data and to avoid silos. This [data lake] brings everything together.

While there are common misconceptions of what a data lake is, it is more flexible than the more traditional “data warehouse.” “In the old days when we had data in data warehouses, we had to, ahead of time, know the schema of the data that’s being stored and if there was any ETL (extracting, transforming and loading) that had to be done,” explained Challa. “When there was a change in the data we had to stop, change the schema of the tables in the data warehouse and then write it. “Data lakes are schema on read said. “Data warehouses only do structured data, whereas data lakes can take videos, text files, logs, JSONs, XMLs—you name it. It can take any kind of data as long as you have room for that data in your data lake.”

Benefits of Using AWS for Big Data & Analytics and Featured Services & Products

?AWS has two new services in the last two years that have become extremely popular:

AWS Transfer for SFTP: Fully managed service enabling transfer of data over SFTP while stored in Amazon S3
AWS Data Sync: Transfer service that simplifies, automates and accelerates data movement

About 80 percent of any work on the data lake is data preparation? “We want to make sure that we provide the best tools and most cost-effective tools to our customers,” said Challa. “And that’s why we have AWS Glue.”

AWS Glue is a serverless ETL service. Challa explained how you need to set up a catalog, ETL and data prep with AWS Glue. Challa also presented Lambda, a productivity-focused computing platform to build powerful, dynamic and modular applications in the cloud.?

Power data science and machine learning

Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science and machine learning with low latency. Raw data can be retained indefinitely at low cost for future use in machine learning and analytics.

Centralize, consolidate and catalogue our data

A centralized data lake eliminates problems with data silos (like data duplication, multiple security policies and difficulty with collaboration), offering downstream users a single place to look for all sources of data.

Quickly and seamlessly integrate diverse data sources and formats

Any and all data types can be collected and retained indefinitely in a data lake, including batch and streaming data, video, image, binary files and more. And since the data lake provides a landing zone for new data, it is always up to date.

Democratize our data by offering users self-service tools

Data lakes are incredibly flexible, enabling users with completely different skills, tools and languages to perform different analytics tasks all at once.

Data lake challenges

Despite their pros, many of the promises of data lakes have not been realized due to the lack of some critical features: no support for transactions, no enforcement of data quality or governance, and poor performance optimizations. As a result, most of the data lakes in the enterprise have become data swamps.

领英推荐

S3 cost optimization

Nir Peleg 9 个月前

Key Components That Make Up Modern Data Architecture…

Vintage Global 7 个月前

How to Choose the Right Data Ingestion Service: AWS…

Dr. Rabi Prasad Padhy 1 年前

Reliability issues

Without the proper tools in place, data lakes can suffer from data reliability issues that make it difficult for data scientists and analysts to reason about the data. These issues can stem from difficulty combining batch and streaming data, data corruption and other factors.

Slow performance

As the size of the data in a data lake increases, the performance of traditional query engines has traditionally gotten slower. Some of the bottlenecks include metadata management, improper data partitioning and others.

Lack of security features

Data lakes are hard to properly secure and govern due to the lack of visibility and ability to delete or update data. These limitations make it very difficult to meet the requirements of regulatory bodies.

For these reasons, a traditional data lake on its own is not sufficient to meet the needs of businesses looking to innovate, which is why businesses often operate in complex architectures, with data siloed away in different storage systems: data warehouses, databases and other storage systems across the enterprise.?Simplifying that architecture by unifying all your data in a data lake is the first step for companies that aspire to harness the power of machine learning and data analytics to win in the next decade.

How a lakehouse solves those challenges

The answer to the challenges of data lakes is the lakehouse, which adds a transactional storage layer on top. A lakehouse that uses similar data structures and data management features as those in a data warehouse but instead runs them directly on cloud data lakes. Ultimately, a lakehouse allows traditional analytics, data science and machine learning to coexist in the same system, all in an open format. A lake house enables a wide range of new use cases for cross-functional enterprise-scale analytics, BI and machine learning projects that can unlock massive business value. Data analysts can harvest rich insights by querying the data lake using SQL, data scientists can join and enrich data sets to generate ML models with ever greater accuracy, data engineers can build automated ETL pipelines, and business intelligence analysts can create visual dashboards and reporting tools faster and easier than before.?These use cases can all be performed on the data lake simultaneously, without lifting and shifting the data, even while new data is streaming in.

Building a lake house with Delta Lake

To build a successful lakehouse, organizations have turned to Delta Lake, an open format data management and governance layer that combines the best of both data lakes and data warehouses. Across industries, enterprises are leveraging Delta Lake to power collaboration by providing a reliable, single source of truth. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end users.

?Data lakes can:?

·?????? Store data:?Data lakes can store large amounts of data without size limits.

·?????? Process data:?Data lakes can process any variety of data.

·?????? Run analytics:?Data lakes can run analytics on machine-generated data.

Some steps for building a data lake include:?

·?????? Setting up storage

·?????? Moving data

·?????? Cleansing, preparing, and cataloging data

·?????? Configuring and enforcing security and compliance poli

Lake house best practices

Use the data lake as a landing zone for all of your data

Save all of your data into your data lake without transforming or aggregating it to preserve it for machine learning and data lineage purposes.

Mask data containing private information before it enters your data lake

Personally identifiable information (PII) must be pseudonymized in order to comply with GDPR and to ensure that it can be saved indefinitely.

Secure your data lake with role- and view-based access controls

Adding view-based ACLs (access control levels) enables more precise tuning and control over the security of your data lake than role-based controls alone.

Build reliability and performance into your data lake by using Delta Lake

The nature of big data has made it difficult to offer the same level of reliability and performance available with databases until now. Delta Lake brings these important features to data lakes.

Catalog the data in our data lake

Use data catalogue and metadata management tools at the point of ingestion to enable self-service data science and analytics. Shell has been undergoing a digital transformation as part of our ambition to deliver more and cleaner energy solutions. As part of this, we have been investing heavily in our data lake architecture. Our ambition has been to enable our data teams to rapidly query our massive data sets in the simplest possible way. The ability to execute rapid queries on petabyte scale data sets using standard BI tools is a game changer for us.— Dan Jeavons, GM Data Science, Shell

Summary

The main benefits of a serverless data lake are that it’s just that — serverless from start to finish — and you only pay for when files come in and are transferred and processed. Data Blog post that recaps how to?build and automate a serverless data lake using an AWS Glue trigger for the Data Catalog and ETL jobs, which includes a cloud formation template that can set up the architecture for you so you can follow instructions and try your hand on how a service data lake works.

Latest global trend

2,642 位关注者

要查看或添加评论，请登录

Ashutosh K.的更多文章

The end of slavery in USA

2025年3月22日

The end of slavery in USA

THE END OF SLAVERY IN USA 22.03.
DeepSeek

2025年2月8日

DeepSeek

DeepSeek Primer: where are we now? by Matt Haldane Even the most ambivalent towards artificial intelligence (AI) have…

2 条评论
Elon Musk

2025年2月6日

Elon Musk

THE OTHER SIDE OF ELON MUSK We cannot always support people. Sometimes cuts must be made.
ADDRESSING AN EMERGENCY SITUATION:

2025年2月3日

ADDRESSING AN EMERGENCY SITUATION:

PRESIDENT DONALD J. TRUMP IMPOSES TARIFFS ON IMPORTS FROM CANADA, MEXICO AND CHINA FROM FEBRUARY 1, 2025 ADDRESSING AN…
BRIC Country attacked by Trump

2025年2月2日

BRIC Country attacked by Trump

President Donald Trump has once again warned Brics nations of 100 per cent tariff if they attempted to replace the US…
Multipurpose use of AI forecasting

2025年2月1日

Multipurpose use of AI forecasting

THE ORIENTATION OF FULLY AUTOMATED FIRMS WILL LOOK LIKE Everyone is ignoring colligative compensations AIs will have…
Biography of Donald Trump

2025年1月28日

Biography of Donald Trump

THE EDITORS OF ENCYCLOPAEDIA BRITANNICA Donald Trump (born June 14, 1946, New York, New York, U.S.
Donald Trump era begins

2025年1月23日

Donald Trump era begins

DECODING ALL OF TRUMP’S DAY 1 PRESIDENTIAL ACTIONS BRAKGROUND This goes beyond the number signed by Joe Biden on his…

1 条评论
Russia vs Ukraine in Near

2025年1月19日

Russia vs Ukraine in Near

CHRONOLOGY OF EVENT FROM 01.01.
Bond yield

2025年1月17日

Bond yield

Bond yield price disturbance by Cental Bank of Developed country percolating to developing economy The 10-year US…

See all articles

DATA LAKES

Ashutosh K.

Ex banker, Now self-employed, MD &CEO of Kumar Group of companies, Author of many books.

领英推荐

Latest global trend

2,642 位关注者

Ashutosh K.的更多文章

社区洞察

其他会员也浏览了

The Definitive Guide to Data Lakes on AWS

Building a Data Ingestion Pipeline on Google Cloud Platform (GCP)

Sneak Peek into Trino with Azure HDInsight on AKS

Azure Cloud Data Engineering

Azure Data Factory: Comprehensive Overview

Building a Scalable Data Lake on AWS: A Comprehensive Guide

Pillars of Modern Data Platform

Navigating Your Migration to Databricks: Architectures and Strategic Approaches

Comparative Analysis of SAP Datasphere, Azure DWH (Synapse Analytics), Microsoft Data Fabric AWS Redshift, Google BigQuery, and Snowflake

领英推荐

Latest global trend

2,642 位关注者

Ashutosh K.的更多文章

The end of slavery in USA

DeepSeek

Elon Musk

ADDRESSING AN EMERGENCY SITUATION:

BRIC Country attacked by Trump

Multipurpose use of AI forecasting

Biography of Donald Trump

Donald Trump era begins

Russia vs Ukraine in Near

Bond yield

社区洞察

其他会员也浏览了

The Definitive Guide to Data Lakes on AWS

Building a Data Ingestion Pipeline on Google Cloud Platform (GCP)

Sneak Peek into Trino with Azure HDInsight on AKS

Azure Cloud Data Engineering

Azure Data Factory: Comprehensive Overview

Building a Scalable Data Lake on AWS: A Comprehensive Guide

Pillars of Modern Data Platform

Navigating Your Migration to Databricks: Architectures and Strategic Approaches

Comparative Analysis of SAP Datasphere, Azure DWH (Synapse Analytics), Microsoft Data Fabric AWS Redshift, Google BigQuery, and Snowflake