EMC’s Federation Business Data Lake…Data Lake 2.0

EMC’s Federation Business Data Lake…Data Lake 2.0

March 23rd’s EMC’s Federation Business Data Lake (FBDL) announcement has been a long time in the making. Many smart people have been working hard to pull together EMC’s FBDL solution, and I am excited if for no other reason than I get to say that I saw the data lake coming as early as May, 2012 when I published my “Understanding the Role of Hadoop In Your BI Environment” blog. Okay, okay, I originally called it a Hadoop-based “Operational Data Store,” but regardless of missing on the name, I got many of the key benefits right:

“Hadoop brings at least two significant advantages to your ETL and data staging processes. The first is the ability to ingest massive amounts of data as-is. That means that you do not need to pre-define the data schema before loading data into Hadoop. This includes both traditional transactional data (e.g., point-of-sale transactions, call detail records, general ledger transactions, call center transactions), but also unstructured internal data (like consumer comments, doctor’s notes, insurance claims descriptions, and web logs) and external social media data (from social media sites such LinkedIn, Pinterest, Facebook and Twitter). So regardless of the structure of your incoming data, you can rapidly load it all into Hadoop, as-is, where it then becomes available for your downstream ETL, DW, and analytic processes (see Stage 1 in Figure 1 below).

Figure 1: My Original "Data Lake" Graphic

The second advantage that Hadoop brings to your BI/DW architecture occurs once the data is in the Hadoop environment. Once it’s in your Hadoop ODS, you can leverage the inherently parallel nature of Hadoop to perform your traditional ETL work of cleansing, normalizing, aligning, and creating aggregates for your EDW at massive scale.

Data Lake Enabling Trends

It’s been a perfect storm of industry trends that are enabling big data and the data lake to be a feasible data architecture option. Those industry trends are:

  • Data Growth- Web applications, social media, mobile apps, sensors, scanners, wearable computing and the Internet of Things are all generating an avalanche of new, more granular data about your customers, channels, products and operations that now can be captured, integrated, mined and acted upon.
  • Cheap Storage- The cost of storage is plummeting, which is enabling organizations to “think differently” about data. Leading organizations are transitioning from viewing data as a cost to be minimized to valuing data as an asset to be horded, even if you don’t yet know how you will use that data; transitioning to “data abundance” mentality
  • Limitless Computing- The ability to bring to bear an almost limitless amount of computing power to any business problem is enabling organizations to process, enrich and analyze this growing wealth of data to uncover actionable insights about their customers and their business operations.
  • Real-time Technologies- Low-latency data access and analysis is enabling organizations to identify and monetize “events in the moment” while there is still value in the freshness or recency of the event.
  • Open Source Software- Open source software is democratizing software tools like Hadoop, R, Shark, YARN, Mahout, MADlib, etc. by putting these tools within the reach of any organization. Open source software is fueling innovation from startups, Fortune 500 organizations, universities and digital media companies; it is liberating organizations from being held captive by the product development cycles of the traditional enterprise software vendors.
  • Data Science- For me, this is the most exciting industry trend. The convergence of analytic tools, the volume, variety and velocity of data combined with training and education, business-centric methodologies (see my book “Big Data: Understanding How Data Powers Big Business”) and innovative thinking is enabling organizations to “weave data hay into business gold” by uncovering customer, product and operational insights from the data lake that can be used to optimize key business processes and uncover new monetization opportunities.

Think “Interconnected Tissue,” Not Technology Stack

What I’ve found is that data lake early adopters have really been building Data Lake 1.0, and Data Lake 1.0 was really just a technology stack comprised of storage, compute, and analytic “layers.” Data Lake 1.0 hinted at the potential of the data lake concept, but due to lack of experience, tool maturity and unproven design methodologies, problems soon started to crop up. One classic Data Lake 1.0 problem is the proliferation of data lakes; that we are starting to take data lakes down the same, value-destructive path that we took the data warehouse – silo’ed data warehouses.

EMC’s FBDL is really the first of the Data Lake 2.0 solutions as the data lake matures from just a technology stack to a living “interconnected tissue” entity (see Figure 2).

Figure 2: EMC’s Federation Business Data Lake = Data Lake 2.0

Issues such as silos in data lakes and operational challenges like cataloging and data governance will be integrated into the very fabric that comprises Data Lake 2.0. Important capabilities that ensure that the storage, compute and analytics capabilities work together are part of the “interconnected tissue” within Data Lake 2.0. This includes key capabilities such as virtualization and data disciplines such as data governance, data quality, master data management, privacy, security and data lineage/audit/traceability. The EMC FBDL Data Governor will eventually provide this “interconnected tissue” support, enabling the data science teams to quickly find, provision, analyze and act upon the data in the data lake in a self-sufficient, “fail fast and learn quickly” manner.

What Does The Future Hold?

EMC’s Federation Business Data Lake takes a big step in the maturation of the data lake by leveraging the big data industry trends to create a living “interconnected tissue” entity. And what does Data Lake 3.0 look like? I don’t know for certain, but I suspect that we’ll see more “smarts” built into the data lake that might even make data and analytic model recommendations based upon the business problem that the business stakeholders are trying to solve; think the “Smart” Data Lake!

The features outlined in FBDL Data Lake 2.0 vision will fuel the business transformational processes that we are already seeing underway at many clients. But there still is a long way to go as tools, training and methodologies will need to continue to evolve to help organizations “think differently” about the role of data and analytics to power their value creation processes.

--------------------

Thanks for taking the time to read my post. I’m fortunate that I spend most of my time with very interesting clients which fuel many of my topics. I hope that you are able to leave a comment or some thoughts about the blog. If you would like to read my regular posts then please click 'Follow' and let’s also connect via Twitter.

In case you are interested, here are some of my favorite posts:

Bill Schmarzo is the author of the book “Big Data: Understanding How Data Powers Big Business”.

Gregory Papaleoni

Driving Business Decisions with Data

9 年

It's fun to be right, especially three years in the making. Looking forward to what's to come. Great stuff, Bill.

Mark Orth

Senior WW Alliance Marketing Manager at Hewlett Packard Enterprise

9 年

Great post Bill Schmarzo. I agree with your supposition on Data Lake 3.0 and potential data awareness capabilities. And since you were visionary about data lakes back in 2012 I'd bet on this next wave of transformation.

要查看或添加评论,请登录

Bill Schmarzo的更多文章

社区洞察

其他会员也浏览了