SQL has made a Big (Data) comeback!
And that makes Hadoop-based Data Lakes a thing of the past.?
With the emergence of next-generation data platforms featuring Distributed SQL and NoSQL capabilities (for handling diverse workloads) and a unified storage engine (for handling structured, semi and unstructured data) — all on a single platform, companies are rethinking their Data Lake strategy.?
Hadoop-based Data Lake implementations in most companies (including many Banks) are struggling to deliver the ROI and thus limiting the company's ability to implement analytics use cases rapidly within an acceptable time frame.
We propose a design strategy that addresses this risk. This strategy was not feasible technically a few years ago. Thanks to the emergence of Distributed SQL platforms, the SQL is making a Big (Data) comeback!
Proposed design in a Nutshell:
1. Implement a proven Distributed SQL Platform (e.g., SingleStore ) as a primary component of the Data Lake.
2. Use an Object Storage (that was traditionally used for building Data Lakes) to store all unstructured data and also less-used/ less-important data (which can be either structured or semi-structured).
“Remember that data lakes do not have to be on a non-relational Hadoop environment. You can build your data lake on a relational database,” he said. “Many of the organizations we talk to, 90% of the data they’re putting on their data lake is structured relational data. So why not put it into a relational database?”
— Donald Feinberg, VP and distinguished analyst in the Gartner Data and Analytics group
Reference Architecture
Please refer the hero image at the top of this page.
Each component in the stack is explained below. The proposed architecture,
1. Includes a primary data storage and processing engine to handle the majority of the data type the organization deals with (which is structured relational data). The same engine would also be able to handle semi-structured NoSQL data (JSON) as well as streaming/ IoT data.
Potential vendors to choose from: Distributed SQL platforms like SingleStore , Teradata, Snowflake, Apache Druid, VMware Greenplum etc.
Key technical features to look for in the platform:
2. A secondary data layer called Object Storage for storing all unstructured data and also data that is of no immediate use. Here the integration between the two is the key. We should be able to query the Object Storage from the primary layer and work with it on the same platform.
Potential Object Storage vendors: Any S3-compatible storage like Ceph, GlusterFS, HDFS (Cloudera)
3. A versatile ETL tool that can read data from traditional as well as SaaS systems. As the database engines became more powerful over the last decade, the ETL practice has moved from ETL to ELTL. The latter leverages the database engine’s distributed processing capacity for all data transformation needs.
Potential ETL tools: Talend, Informatica Powercenter, Oracle Data Integrator, IBM Infosphere Datastage
4. An optional data virtualization layer that connects to the sources where they are without having to move data to a central location. Benefits include rapid implementation timelines and lesser overheads (as there is no data warehouse to maintain). But this cannot be an all-in strategy. SBM can use it selectively to quickly integrate small SaaS vendors, for example.
Potential DV tools: Denodo , Tibco
领英推荐
5. As the organization matures its data practice and moves up the analytics value chain, there comes a need for MLOps layer to help build, deploy and maintain the Machine Learning Pipelines and Predictive Models across Bank-wide functions. AI/ML workloads are different from traditional reporting and BI workloads. While the MLOps tool helps in managing a predictive model’s life cycle, the EDW is where all the data heavy lifting happens. Together they enhance a Data Scientist’s productivity.
Potential MLOps tools: Dataiku , RapidMiner, KNIME, H2O.ai, AWS Sagemaker, Azure ML Studio, Iguazio etc.
Below are some Research References (in addition to Everlytics' own experience) favoring the proposed architecture.
— Gartner Predictions 2014
The insightful CTO of a Fortune 500 enterprise once said, “We love our Hadoop data lake because we can store all data forever at a very low cost per terabyte. At the same time, we hate our Hadoop data lake because the only people who can get the data out are the people who put the data in.”
“Business and IT leaders are overestimating the effectiveness and usefulness of data lakes in their data and analytics strategies.”
— Gartner Research, Published on 10 August 2018
“The number and importance of Big Data projects is increasing, but unfortunately, a large proportion of Big Data projects are failing.”
— David K. Becker, Systems Analysis and Consulting Beavercreek, OH
“With many organizations having invested tens and even hundreds of millions of dollars in data lakes that deliver little or no business value, it’s way past time for some brutal self-assessment in the technology industry.”
— Martin Willcox at Teradata Corporation
“Only 13% of organizations have achieved full-scale production for their Big Data implementations.”
— Capgemini Consulting, “Big Data Survey”, November 2014
“A staging layer is more tightly controlled and requires longer development time, but has the benefit of increased accuracy and trust in the data warehouse.”