SQL has made a Big (Data) comeback!

SQL has made a Big (Data) comeback!

And that makes Hadoop-based Data Lakes a thing of the past.?

With the emergence of next-generation data platforms featuring Distributed SQL and NoSQL capabilities (for handling diverse workloads) and a unified storage engine (for handling structured, semi and unstructured data) — all on a single platform, companies are rethinking their Data Lake strategy.?

Hadoop-based Data Lake implementations in most companies (including many Banks) are struggling to deliver the ROI and thus limiting the company's ability to implement analytics use cases rapidly within an acceptable time frame.

We propose a design strategy that addresses this risk. This strategy was not feasible technically a few years ago. Thanks to the emergence of Distributed SQL platforms, the SQL is making a Big (Data) comeback!

Proposed design in a Nutshell:

1. Implement a proven Distributed SQL Platform (e.g., SingleStore ) as a primary component of the Data Lake.

  • ETL all source systems and manual/CSV data directly into this platform.
  • Design the physical data models and schemas specific to each business function and logically join them to create cross-functional views.

2. Use an Object Storage (that was traditionally used for building Data Lakes) to store all unstructured data and also less-used/ less-important data (which can be either structured or semi-structured).

  • Attach the Object Storage to the primary Platform and access/ query the semi-structured and structured data directly from within the Platform.
  • For unstructured data, depending on where and how we plan to use it, the Object Storage can be accessed accordingly. For example, a Data Scientist can access PDF files stored in Object Storage using an MLOps tool and build a document classification model.


Insights from a Gartner Analyst

“Remember that data lakes do not have to be on a non-relational Hadoop environment. You can build your data lake on a relational database,” he said. “Many of the organizations we talk to, 90% of the data they’re putting on their data lake is structured relational data. So why not put it into a relational database?”

— Donald Feinberg, VP and distinguished analyst in the Gartner Data and Analytics group


Reference Architecture

Please refer the hero image at the top of this page.

Each component in the stack is explained below. The proposed architecture,

1. Includes a primary data storage and processing engine to handle the majority of the data type the organization deals with (which is structured relational data). The same engine would also be able to handle semi-structured NoSQL data (JSON) as well as streaming/ IoT data.

Potential vendors to choose from: Distributed SQL platforms like SingleStore , Teradata, Snowflake, Apache Druid, VMware Greenplum etc.

Key technical features to look for in the platform:

  • The database engine optimized for both OLAP and OLTP workloads
  • An intelligent storage engine that can leverage in-memory, disk and object storage, and handle both column and row type data structures
  • Fast INSERTs, Big JOINs and Fast DELETEs
  • Distributed, multi-node cluster architecture with no single point of failure, data replication for redundancy, scalable horizontally (add nodes as you grow).

2. A secondary data layer called Object Storage for storing all unstructured data and also data that is of no immediate use. Here the integration between the two is the key. We should be able to query the Object Storage from the primary layer and work with it on the same platform.

Potential Object Storage vendors: Any S3-compatible storage like Ceph, GlusterFS, HDFS (Cloudera)

3. A versatile ETL tool that can read data from traditional as well as SaaS systems. As the database engines became more powerful over the last decade, the ETL practice has moved from ETL to ELTL. The latter leverages the database engine’s distributed processing capacity for all data transformation needs.

Potential ETL tools: Talend, Informatica Powercenter, Oracle Data Integrator, IBM Infosphere Datastage

4. An optional data virtualization layer that connects to the sources where they are without having to move data to a central location. Benefits include rapid implementation timelines and lesser overheads (as there is no data warehouse to maintain). But this cannot be an all-in strategy. SBM can use it selectively to quickly integrate small SaaS vendors, for example.

Potential DV tools: Denodo , Tibco

5. As the organization matures its data practice and moves up the analytics value chain, there comes a need for MLOps layer to help build, deploy and maintain the Machine Learning Pipelines and Predictive Models across Bank-wide functions. AI/ML workloads are different from traditional reporting and BI workloads. While the MLOps tool helps in managing a predictive model’s life cycle, the EDW is where all the data heavy lifting happens. Together they enhance a Data Scientist’s productivity.

Potential MLOps tools: Dataiku , RapidMiner, KNIME, H2O.ai, AWS Sagemaker, Azure ML Studio, Iguazio etc.

Below are some Research References (in addition to Everlytics' own experience) favoring the proposed architecture.

Through 2018, 90% of deployed data lakes will be useless

— Gartner Predictions 2014


Insights from a CTO

The insightful CTO of a Fortune 500 enterprise once said, “We love our Hadoop data lake because we can store all data forever at a very low cost per terabyte. At the same time, we hate our Hadoop data lake because the only people who can get the data out are the people who put the data in.”


How to Avoid Data Lake Failures

“Business and IT leaders are overestimating the effectiveness and usefulness of data lakes in their data and analytics strategies.”

— Gartner Research, Published on 10 August 2018


2017 Big Data Project Failure Study

“The number and importance of Big Data projects is increasing, but unfortunately, a large proportion of Big Data projects are failing.”

— David K. Becker, Systems Analysis and Consulting Beavercreek, OH


The Data Lake is Dead

“With many organizations having invested tens and even hundreds of millions of dollars in data lakes that deliver little or no business value, it’s way past time for some brutal self-assessment in the technology industry.”

— Martin Willcox at Teradata Corporation


2014 Big Data Failure Study

“Only 13% of organizations have achieved full-scale production for their Big Data implementations.”

— Capgemini Consulting, “Big Data Survey”, November 2014


Do you need a data lake or staging layer?

“A staging layer is more tightly controlled and requires longer development time, but has the benefit of increased accuracy and trust in the data warehouse.”

要查看或添加评论,请登录

社区洞察

其他会员也浏览了