登录查看更多内容

Are Big Data Compromises Inevitable?

Eran Vanounou

CTO at Forter | Tech executive (LivePerson, Oracle, NICE, Sun) | Connecting people, technology and business

发布日期: 2020年1月21日

The Big Data Evolution

I’ve held leadership positions in tech companies for over 20 years. Like many others, I’ve seen how data rose dramatically in scale and importance. Unfortunately, data management has become increasingly more complex in direct relation to this surge in data criticality.

In the past, there really wasn’t much to consider; you had a central SQL database storing your data.

At the turn of the century, with Web 2.0 taking over, databases were required to serve numerous clients concurrently, while handling greater volume and throughput than before. NoSQL surged in popularity to address these needs but ultimately fell out of favor as the one-stop solution; as many of us learned the hard way, reporting and running BI over NoSQL is no fun.

In recent years, the central importance of data has risen dramatically. It is now common practice to store raw data as well as processed data. When dealing with massive amounts of Big Data, a single SQL or NoSQL database is far from sufficient. This drove many companies to adopt a strategy of separating data capture and data consumption.

Data Capture is the practice of storing raw data (often from various data sources) in an affordable and robust silo in the cloud.

Data Consumption is the layer of clients that read the data; this includes applications, services, BI tools, data scientists, machine learning programs, etc.

This separation supports the single source of truth paradigm, the potential for data to be available in near real-time and allows storing the raw data in full, without losing any potentially important dimensions.

In the layer between Data Capture and Data Consumption, reside data servicing tools. Generally speaking, these tools can be divided into two groups:

The ones that optimize Cost and Performance
The ones that strive for maximum Agility and Flexibility

Unfortunately, tools that emphasize Cost and Performance are rigid; run a query for which the system wasn’t designed for and you’ll be waiting far too long for results. Tools that emphasize Agility free you from the need to prepare the data in advance but their performance to cost ratio is not viable for operational concerns.

Cost and Performance focused Solutions

The rise of distributed SQL solutions such as AWS Redshift, Google BigQuery, and Azure Synapse as well as a slew of independent solutions signaled the return to SQL databases for most applications. The distributed aspect brought great improvements in speed and concurrency.

These solutions work wonderfully once they are up and running. However, getting there is not a small feat.

To get these solutions working properly, you’ll need to model the data in an optimal way. The model should be designed per the queries you’ll be running so you have to have a good process of anticipating what these queries will be. Once the design is ready, you’ll need to prepare the ETL process that brings the data from your data lake to your solution of choice.

All this requires planning, resources (tools and a dedicated team) and time.

If you are lucky business requirements will remain static for at least as long as it took you to set up all of the above. Unfortunately, this is often not the case and being able to respond quickly to changing business requirements is a necessity, not a convenience.

Agile and Flexible Solutions

There’s another family of data solutions that emphasize agility and offer high flexibility. Solutions such as AWS Athena and Presto on EMR allow you to run queries directly on your data lake, with no need to prepare the data.

AWS Athena is a great tool that allows anyone with basic SQL knowledge to perform powerful inquiries directly on the data lake. The freedom to run any query and get live results without any data preparation whatsoever is a game-changer. Many companies make Athena available to their data engineers and analysts for preliminary data research. Unfortunately, the pricing model and performance usually makes this tool impractical for most operational and customer-facing solutions.

Presto is an open-source distributed SQL engine designed for the world of big data. Presto integrates with a wide range of data sources such as RDBMSs, NoSQL solutions and Hadoop data warehouses. One of Presto’s unique advantages is that it allows you to run SQL queries across different data sources.

Presto is quickly becoming one of the most common solutions used by data-driven companies. Companies such as Salesforce, Uber and Netflix

Presto on Amazon EMR is a popular choice as it gives you a very flexible solution. It can leverage data partitioning where applicable but can also handle other queries. The distributed aspect allows you to improve performance by increasing the Presto cluster. However, this translates to higher costs.

Are we doomed to cycle between these two compromises? Must we choose between having the freedom and agility to query on different dimensions or to have a fast and cost-efficient system? Is it really impossible to have both?

Can’t we do better? We definitely need to. Looking to the future, the challenges will only increase. The overall volume, number of data sources and different types of data will continue to increase. Moreover, the strive for insights will continue to rise with huge growth in data consumers. Today, a company’s success is directly related to how effectively it utilizes its data. In the future, this correlation will only be stronger.

This is exactly why we started Varada!

After being in stealth mode for a couple of years, we can finally share that we’ve built a big data infrastructure platform that is not only fast and flexible, it is x100 faster than any database we compared it with.

Check out my new post with all the exciting details.

Amos Marom

Head of AI education program, Cyber Education Center

5 年

In the 21st century, the world is divided between the ones that uses big data and the ones that still live in the 20th century.? Varada has a solution that delivers flexibility and simplicity in big data acess with flash performance.?

5 次回应

要查看或添加评论，请登录

Eran Vanounou的更多文章

Are Big Data Compromises Inevitable? Part II

2020年2月5日

Are Big Data Compromises Inevitable? Part II

In my previous article, I gave a brief history of big data evolution and reviewed the inevitable compromises one must…

1 条评论
Presto Conference @ Israel - April 15

2019年3月28日

Presto Conference @ Israel - April 15

I am proud and excited to invite you to the very first Presto Conference in Israel.More Details can be found here…
Rethink BigData Meetup @ SimilarWeb

2018年11月4日

Rethink BigData Meetup @ SimilarWeb

Over a couple of months ago I took on the challenge of leading Varada. Now that I’ve settled in, I’m excited to unveil…
Laying a new foundation for data at Varada

2018年8月29日

Laying a new foundation for data at Varada

After more than two decades in the industry at some of the biggest tech companies, I recently made a big change and…

6 条评论

Are Big Data Compromises Inevitable?

Eran Vanounou

CTO at Forter | Tech executive (LivePerson, Oracle, NICE, Sun) | Connecting people, technology and business

The Big Data Evolution

Cost and Performance focused Solutions

Agile and Flexible Solutions

Eran Vanounou的更多文章

社区洞察

其他会员也浏览了

Time Series Database and Analytics using Azure Data Explorer

SQL Server 2025: Redefining Database Innovation

What is Azure Data Factory? An Introduction and Deep Dive

Big Data Concepts and Applications for Business Growth and Career Success

Importance of partitioning in Data-intensive Analytics Solution Design

BigLake : A Multi-Cloud Data Strategy

Why Use a Graph Database? Benefits Of Graph Databases

DP-900: Microsoft Azure Data Fundamentals - Relational Data on Azure

Data is the New Oil: How to Incorporate Unstructured Data into Your Business

Data Transformation using CETAS in Serverless SQL Pool of Azure Synapse

The Big Data Evolution

Cost and Performance focused Solutions

Agile and Flexible Solutions

Eran Vanounou的更多文章