登录查看更多内容

EMC’s Federation Business Data Lake…Data Lake 2.0

Bill Schmarzo

Dean of Big Data, CDO Chief AI Officer Whisperer, recognized global innovator, educator, and practitioner in Big Data, Data Science, & Design Thinking

发布日期: 2015年3月25日

March 23rd’s EMC’s Federation Business Data Lake (FBDL) announcement has been a long time in the making. Many smart people have been working hard to pull together EMC’s FBDL solution, and I am excited if for no other reason than I get to say that I saw the data lake coming as early as May, 2012 when I published my “Understanding the Role of Hadoop In Your BI Environment” blog. Okay, okay, I originally called it a Hadoop-based “Operational Data Store,” but regardless of missing on the name, I got many of the key benefits right:

“Hadoop brings at least two significant advantages to your ETL and data staging processes. The first is the ability to ingest massive amounts of data as-is. That means that you do not need to pre-define the data schema before loading data into Hadoop. This includes both traditional transactional data (e.g., point-of-sale transactions, call detail records, general ledger transactions, call center transactions), but also unstructured internal data (like consumer comments, doctor’s notes, insurance claims descriptions, and web logs) and external social media data (from social media sites such LinkedIn, Pinterest, Facebook and Twitter). So regardless of the structure of your incoming data, you can rapidly load it all into Hadoop, as-is, where it then becomes available for your downstream ETL, DW, and analytic processes (see Stage 1 in Figure 1 below).

Figure 1: My Original "Data Lake" Graphic

The second advantage that Hadoop brings to your BI/DW architecture occurs once the data is in the Hadoop environment. Once it’s in your Hadoop ODS, you can leverage the inherently parallel nature of Hadoop to perform your traditional ETL work of cleansing, normalizing, aligning, and creating aggregates for your EDW at massive scale.

Data Lake Enabling Trends

It’s been a perfect storm of industry trends that are enabling big data and the data lake to be a feasible data architecture option. Those industry trends are:

Data Growth- Web applications, social media, mobile apps, sensors, scanners, wearable computing and the Internet of Things are all generating an avalanche of new, more granular data about your customers, channels, products and operations that now can be captured, integrated, mined and acted upon.
Cheap Storage- The cost of storage is plummeting, which is enabling organizations to “think differently” about data. Leading organizations are transitioning from viewing data as a cost to be minimized to valuing data as an asset to be horded, even if you don’t yet know how you will use that data; transitioning to “data abundance” mentality
Limitless Computing- The ability to bring to bear an almost limitless amount of computing power to any business problem is enabling organizations to process, enrich and analyze this growing wealth of data to uncover actionable insights about their customers and their business operations.
Real-time Technologies- Low-latency data access and analysis is enabling organizations to identify and monetize “events in the moment” while there is still value in the freshness or recency of the event.
Open Source Software- Open source software is democratizing software tools like Hadoop, R, Shark, YARN, Mahout, MADlib, etc. by putting these tools within the reach of any organization. Open source software is fueling innovation from startups, Fortune 500 organizations, universities and digital media companies; it is liberating organizations from being held captive by the product development cycles of the traditional enterprise software vendors.
Data Science- For me, this is the most exciting industry trend. The convergence of analytic tools, the volume, variety and velocity of data combined with training and education, business-centric methodologies (see my book “Big Data: Understanding How Data Powers Big Business”) and innovative thinking is enabling organizations to “weave data hay into business gold” by uncovering customer, product and operational insights from the data lake that can be used to optimize key business processes and uncover new monetization opportunities.

Think “Interconnected Tissue,” Not Technology Stack

What I’ve found is that data lake early adopters have really been building Data Lake 1.0, and Data Lake 1.0 was really just a technology stack comprised of storage, compute, and analytic “layers.” Data Lake 1.0 hinted at the potential of the data lake concept, but due to lack of experience, tool maturity and unproven design methodologies, problems soon started to crop up. One classic Data Lake 1.0 problem is the proliferation of data lakes; that we are starting to take data lakes down the same, value-destructive path that we took the data warehouse – silo’ed data warehouses.

EMC’s FBDL is really the first of the Data Lake 2.0 solutions as the data lake matures from just a technology stack to a living “interconnected tissue” entity (see Figure 2).

Figure 2: EMC’s Federation Business Data Lake = Data Lake 2.0

Issues such as silos in data lakes and operational challenges like cataloging and data governance will be integrated into the very fabric that comprises Data Lake 2.0. Important capabilities that ensure that the storage, compute and analytics capabilities work together are part of the “interconnected tissue” within Data Lake 2.0. This includes key capabilities such as virtualization and data disciplines such as data governance, data quality, master data management, privacy, security and data lineage/audit/traceability. The EMC FBDL Data Governor will eventually provide this “interconnected tissue” support, enabling the data science teams to quickly find, provision, analyze and act upon the data in the data lake in a self-sufficient, “fail fast and learn quickly” manner.

What Does The Future Hold?

EMC’s Federation Business Data Lake takes a big step in the maturation of the data lake by leveraging the big data industry trends to create a living “interconnected tissue” entity. And what does Data Lake 3.0 look like? I don’t know for certain, but I suspect that we’ll see more “smarts” built into the data lake that might even make data and analytic model recommendations based upon the business problem that the business stakeholders are trying to solve; think the “Smart” Data Lake!

The features outlined in FBDL Data Lake 2.0 vision will fuel the business transformational processes that we are already seeing underway at many clients. But there still is a long way to go as tools, training and methodologies will need to continue to evolve to help organizations “think differently” about the role of data and analytics to power their value creation processes.

--------------------

Thanks for taking the time to read my post. I’m fortunate that I spend most of my time with very interesting clients which fuel many of my topics. I hope that you are able to leave a comment or some thoughts about the blog. If you would like to read my regular posts then please click 'Follow' and let’s also connect via Twitter.

In case you are interested, here are some of my favorite posts:

Big Data Senior Executive C.A.R.E. Package
To Achieve Big Data’s Potential, Get It Into The Boardroom
Vision Workshop
Big Data Business Model Maturity Index (animation)
Big Data For Competitive Differentiation
Developing a Business Strategy with Big Data
User Experience: the new king of the business
How I’ve Learned To Stop Worrying And Love The Data Lake

Bill Schmarzo is the author of the book “Big Data: Understanding How Data Powers Big Business”.

Gregory Papaleoni

Driving Business Decisions with Data

9 年

It's fun to be right, especially three years in the making. Looking forward to what's to come. Great stuff, Bill.

1 次回应

Mark Orth

Senior WW Alliance Marketing Manager at Hewlett Packard Enterprise

9 年

Great post Bill Schmarzo. I agree with your supposition on Data Lake 3.0 and potential data awareness capabilities. And since you were visionary about data lakes back in 2012 I'd bet on this next wave of transformation.

1 次回应

查看更多评论

要查看或添加评论，请登录

Bill Schmarzo的更多文章

Why Everyone Needs to Think Like a Data Scientist in Today’s Environment

2022年7月16日

Why Everyone Needs to Think Like a Data Scientist in Today’s Environment

The rise of data is driving an unprecedented wave of business opportunity across all business areas. However, with such…

39 条评论
Data Management Sessions at Dell Technologies World 2022

2022年4月25日

Data Management Sessions at Dell Technologies World 2022

Data, data everywhere…not a byte to use! As much as enterprises are getting ready to brace for the Data Decade, it is a…

18 条评论
Mastering the Data Economic Multiplier Effect and Marginal Propensity to Reuse

2021年6月6日

Mastering the Data Economic Multiplier Effect and Marginal Propensity to Reuse

Note: this blog introduces the concept of the Marginal Propensity to Reuse which is the primary driver behind the Data…

29 条评论
Data Science 2.0: From Analytic Outputs to Business Outcomes

2021年4月25日

Data Science 2.0: From Analytic Outputs to Business Outcomes

The “Data Science Learning Roadmap for 2021” in Figure 1 created by FreeCodeCamp does a great job of articulating the…

5 条评论
Data Science 2.0: From Analytic Outputs to Business Outcomes

2021年3月9日

Data Science 2.0: From Analytic Outputs to Business Outcomes

The “Data Science Learning Roadmap for 2021” in Figure 1 created by FreeCodeCamp does a great job of articulating the…

5 条评论
Digital Transformation Requires Redefining Role of Data Governance

2021年2月8日

Digital Transformation Requires Redefining Role of Data Governance

I’m overjoyed to announce the release of my latest book “The Economics of Data, Analytics, and Digital Transformation.”…

17 条评论
Master Machine and Human Learning to Win the Digital Transformation Wars

2021年1月18日

Master Machine and Human Learning to Win the Digital Transformation Wars

The “Economies of Learning” are more powerful than the “Economies of Scale” This may be my most powerful concept…

12 条评论
Crossing the Analytics Chasm with Nanoeconomics

2021年1月11日

Crossing the Analytics Chasm with Nanoeconomics

“I love it when a plan comes together” – John (Hannibal) Smith, The A Team One of the biggest challenges that I…

16 条评论
Ethical AI, Monetizing False Negatives and Growing Total Addressable Market

2020年12月28日

Ethical AI, Monetizing False Negatives and Growing Total Addressable Market

What if I told you that companies that don’t embrace Ethical AI are leaving significant amounts of “Money on the…

5 条评论
Mastering Nanoeconomics in the Era of Digital Transformation

2020年12月21日

Mastering Nanoeconomics in the Era of Digital Transformation

As I state in the opening paragraph of my new book “The Economics of Data, Analytics, and Digital Transformation”: “The…

11 条评论

See all articles

EMC’s Federation Business Data Lake…Data Lake 2.0

Bill Schmarzo

Dean of Big Data, CDO Chief AI Officer Whisperer, recognized global innovator, educator, and practitioner in Big Data, Data Science, & Design Thinking

Data Lake Enabling Trends

Think “Interconnected Tissue,” Not Technology Stack

What Does The Future Hold?

Bill Schmarzo的更多文章

社区洞察

其他会员也浏览了

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

How big is?BIGDATA? - No fuss, straight talk.

Hadoop Big Data Analytics Market Quantitative, Qualitative, and Growth Factors Analysis 2029

Big Data Trends - Expertzlab Technologies

Top 10 Big Data Trends for 2017

The Evolution of Big Data Analytics: From Data Warehousing to Predictive Insights

Big Data Analytics: 5 Things You Might Have Missed!

5 Use Cases for Integrating Big Data Tools with a Data Warehouse

Data-related works never ends

Big data & Big Management

Data Lake Enabling Trends

Think “Interconnected Tissue,” Not Technology Stack

What Does The Future Hold?

Bill Schmarzo的更多文章

Why Everyone Needs to Think Like a Data Scientist in Today’s Environment

Data Management Sessions at Dell Technologies World 2022

Mastering the Data Economic Multiplier Effect and Marginal Propensity to Reuse

Data Science 2.0: From Analytic Outputs to Business Outcomes

Data Science 2.0: From Analytic Outputs to Business Outcomes

Digital Transformation Requires Redefining Role of Data Governance

Master Machine and Human Learning to Win the Digital Transformation Wars

Crossing the Analytics Chasm with Nanoeconomics

Ethical AI, Monetizing False Negatives and Growing Total Addressable Market

Mastering Nanoeconomics in the Era of Digital Transformation

社区洞察

其他会员也浏览了

Data Lake / Mesh / Data Fabric and Everything in Between (The Active Metadata)

How big is?BIGDATA? - No fuss, straight talk.

Hadoop Big Data Analytics Market Quantitative, Qualitative, and Growth Factors Analysis 2029

Big Data Trends - Expertzlab Technologies

Top 10 Big Data Trends for 2017

The Evolution of Big Data Analytics: From Data Warehousing to Predictive Insights

Big Data Analytics: 5 Things You Might Have Missed!

5 Use Cases for Integrating Big Data Tools with a Data Warehouse

Data-related works never ends

Big data & Big Management