登录查看更多内容

BigQuery: Chapter 1A - An Introduction

Kunal Mehta

Global Data Platform Head | Product Owner | Associate Director | Data Analytics | Google Analytics | Adobe Analytics | Google Cloud Platform | Machine Learning | Data Science | Data Engineering | Speaker

发布日期: 2018年10月14日

Hello Everyone, so, let’s start discussion on BigQuery today, the platform, where it’s coming from, it’s architecture, why it’s as fast as it is, and most importantly, why you, as a marketer, would want to use this technology. Before you go any further, my request would be to go through the below 2 articles, to understand what we are trying to do here:

As explained in the earlier post, this would be first part of the series where we will talk about the underlying technology and it’s architecture. In this part I will stop short of giving you the use cases, answer to why BigQuery, because I think it would be overdose at this point, but in subsequent articles, I will give more details from marketer's perspective about how and why we should use this platform. Hope giving bite size information will help everyone scale up little easy.

Introducing Dremel

Google has been dealing with Big Data, before it was cool. For them a simple question like, how many impressions of a specific keyword they had for one region, is a Big Data query. The volume of the data is just too big for any standard database to handle, and any DBMS language to provide a quick answer. This is the reason why Google came up with it’s in-house core technology ‘Dremel’. The idea was to create a query service, which can be used to write SQL queries, with only difference, on very large datasets. The idea behind focusing on SQL was that since it’s simple, and many folks, including the ones without any technical background, have an exposure to it, would make it easy to be adopted for different operations within Google, like Tech Support, Technical Account Managers, analysts, etc.

BigQuery is nothing but public implementation of the same Dremel technology, which Google has provided to third party developers. So, much like many well known cloud technologies out there, BigQuery is also a Google product, which was developed for Google, by Google, and probably they never thought that one day they will provide this platform to external developers to utilize.

BigQuery is made available to developers through REST APIs, Command line interface, Traditional GUI interface and now a new revamped GUI interface. To give you an idea about the power of underlying Dremel, it can scan more than 30 billion rows without index in tens of seconds. Now this point might be confusing to traditional Data Warehouse experts, how queries would be managed when database is not indexed in the first place? And then how can it be so fast without any query management? Google resolves this problem by 1) creating a cloud powered infrastructure where each query can be parallelized and run on tens of thousands of servers simultaneously 2) by using a much improved data storage technology, which compresses on columns, rather than rows, and each column of data is stored on it’s own storage volume. We will devote more space and time on 2nd point in later sections.

Thanks to Google, there’s no cost to maintain this size of infrastructure for users, the pricing system of BigQuery is so robust and effective that it by itself makes it one of the most lucrative options in the market.

In addition to SQL, Dremel also enables it’s users to utilize Regular Expressions for string matching within the same SQL syntax so that it can deal with String data in much more efficient manner. Infact if your query is taking longer than few tens of seconds, you really need to check what’s going on, since it has performance of output of processing and spitting out results for queries which scanned 20 TB of data within tens of seconds.

A view of underlying Architecture

As promised earlier, let’s understand the architecture which makes BigQuery as blazing fast as it is. There are 2 key aspects to Dremel’s technology that makes it as fast:

1) Columnar Storage

2) Tree Architecture

Think about how data is stored? To compress data size and then makes querying faster, redundancies are first removed from the data and then rows with removed redundancies are stored on the same storage volumes. This is something that is done on most of the traditional databases. But even better solution would be if we can remove redundancies on Columns. Take a column Country for example, now there would always be more cases where country name would appear more than once, and hence these redundancies can be removed on column level rather than row level. On top of that, each of the column can be stored in it’s own storage volume. Now this is again something which someone like Google can do, for their immense infrastructure that they are having already.

This kind of storage again gives you 2 straight advantages:

1) Traffic minimization: The moment you put your Select statement, with column names that you want, the query will only be applied on the relevant storage volumes. So, even when you are querying a dataset which might be 40 GB, your query might be running only on 10 GB of data, because you selected 5 columns in your Select statement.

2) Compression Ratio: Now this is no rocket science, since cardinality, that is variation in possible values, of columns is much lower for columns, than for rows, so, data can be compressed as much as 1:10 times, rather than 1:3 times in traditional databases.

At this point 2nd challenge for Dremel was how to first dispatch queries in multiple of storage volumes and then combine the results across tens of thousands of machines in seconds? This was addressed by developing a massively distributed Tree Architecture, for first pushing a query to a tree and then aggregating the results from leaves.

How BigQuery ties up with Dremel?

Few years ago Google made BigQuery as a publicly available service for developers to utilize the power of Dremel for their own Big Data requirements.

Some of you might have heard about terms like MapReduce or Hadoop, and at this point might be wondering how BigQuery would be having any additional advantage over that system? To the uninitiated, MapReduce is a Big Data processing programming model, which is implemented by open sources like Hadoop. Fact is that MapReduce was actually invented by Google, and till 2014 was used by them to query their Big Data requirements. But since the Google’s operations required a faster framework, MapReduce wasn’t really cutting it for them, and that’s why MapReduce and it’s internal UI for the same, was discontinued, and Google developed Dremel as a faster alternative to MapReduce, though both at a high level, utilize the power of massive parallel computing infrastructures.

BigQuery as a Data Warehouse

Now this is a little difficult question to answer, fact is that BigQuery is not really a Data Warehouse, but then another fact is that most of the times Data Warehouses are used for simple use cases like BI, and then for that there are simple operations running in the background like creating Relational databases, or cubes or data marts, in-memory databases, parallel I/O. Some of these simple operations can be very expensive like in case of creating in-memory database, or complicated, like in case of creating indices for relational databases, or not as easy for instantaneous requests, like when you have to have data cubes before you run any query against it. For these reasons it’s easier and better to utilize the power of Dremel and take advantage of thousands of disk drives, processors and servers maintained by Google to run your own queries.

Can you use your own Big Data with Big Query?

Absolutely you can, though first you will have to import your Big Data to Google Cloud Storage, and then linking BigQuery with the same. None of these operations are either difficult or very expensive, so highly recommended. I still will not recommend to replace your Big Data solution with BigQuery, because whereas your own implementation of MapReduce can help you for your data mining operations, long running batch operations, BigQuery definitely is a better alternative for BI queries which require faster results.

Vakesh Singh

Martech Consulting | CDPs | Adobe Suit | Product Marketing | ex - Dentsu, Sapient

6 年

Loved this, eagerly waiting for next article. Precise, crisp and so much to learn.

Upansh Agarwal

Digital Analytics Lead | CRO & Personalisation practitioner | Growth Strategist

6 年

Well explained Kunal. Will wait for further chapters...

Tanaya Pathak

Go-to-marketer - Global Markets | Strategy, Planning and Execution | Data & Analytics, AI & Automation, Agile DevOps, and Marketplace

6 年

Amazing Sir! Your style of explaining things has always been so sorted, in depth and simple. Thank you.

Partha Sanyal

6 年

Very nicely articulated Kunal - I will keep reading the rest of the series too.

查看更多评论

要查看或添加评论，请登录

Kunal Mehta的更多文章

Key Terms for Gen AI

2024年5月23日

Key Terms for Gen AI

Hello Everyone, as we are exploring the field of Gen AI, we have already gone through 2 ‘chapters’ so to say. First was…

2 条评论
Introduction to LangChain

2024年5月14日

Introduction to LangChain

LangChain is a framework in place to develop LLM applications. The best part about this framework is that it supports…

11 条评论
Making sense of the biggest lockdown of the world with numbers

2020年5月17日

Making sense of the biggest lockdown of the world with numbers

The purpose of the article is 3 folds, 1) To understand the COVID-19 impact in India, in terms of number of cases and…

61 条评论
Marketing Strategies for Hotel Industry: Leveraging GCP to boost your strategies

2019年11月11日

Marketing Strategies for Hotel Industry: Leveraging GCP to boost your strategies

I have always been asked whether Google Cloud Platform’s auto ML options, primarily BigQuery ML, can come close to the…

6 条评论
Understanding Polynomial Linear Regression

2019年11月7日

Understanding Polynomial Linear Regression

Whenever we talk about regression, more often than not people assume simple linear regression, an equation which seems…
Adobe Analytics comes to Google Cloud Platform

2019年10月20日

Adobe Analytics comes to Google Cloud Platform

Two of the biggest webanalytics solution providers have been slugging it out for far too long it seems, and now…

13 条评论
BigQuery Chapter 2A: Interacting with the Platform

2018年10月29日

BigQuery Chapter 2A: Interacting with the Platform

Hello All, in this edition of Learning BQ, we will focus on Interacting with BigQuery console, and creating the…
BigQuery: Chapter 1B - Intro Continued

2018年10月21日

BigQuery: Chapter 1B - Intro Continued

This time we are going to talk about the real reasons why we, as marketers, need to use data analytics, that too with…

3 条评论
Google BigQuery: Unlocking the power of Google Cloud Platform

2018年10月13日

Google BigQuery: Unlocking the power of Google Cloud Platform

More than an year ago, I started writing how marketers can utilize the power of Statistics in their workstream, and…

4 条评论
Why Time Travel would be difficult?

2018年9月30日

Why Time Travel would be difficult?

I know that all of us have asked this question, not to anyone else, but to ourselves..

2 条评论

See all articles

BigQuery: Chapter 1A - An Introduction

Kunal Mehta

Global Data Platform Head | Product Owner | Associate Director | Data Analytics | Google Analytics | Adobe Analytics | Google Cloud Platform | Machine Learning | Data Science | Data Engineering | Speaker

Kunal Mehta的更多文章

社区洞察

其他会员也浏览了

Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 1

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Bigquery

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Week 10 (4 Mar - 10 Mar)

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

A Deep Dive into Google Cloud's Data Warehouse Solution

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Kunal Mehta的更多文章

Key Terms for Gen AI

Introduction to LangChain

Making sense of the biggest lockdown of the world with numbers

Marketing Strategies for Hotel Industry: Leveraging GCP to boost your strategies

Understanding Polynomial Linear Regression

Adobe Analytics comes to Google Cloud Platform

BigQuery Chapter 2A: Interacting with the Platform

BigQuery: Chapter 1B - Intro Continued

Google BigQuery: Unlocking the power of Google Cloud Platform

Why Time Travel would be difficult?

社区洞察

其他会员也浏览了

Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 1

Mapping Microsoft's Data Analytics Landscape – Comparing Databricks, Synapse and Fabric

Bigquery

OpenSearch Make it Easy to Retrieve, Search, Visualize, and Analyze Your Data:

Snowflake VS Azure Synapse | 7 reasons why you should choose Snowflake OR Synapse on Azure

Week 10 (4 Mar - 10 Mar)

Simplify Your Azure Table Storage Experience with DynamicTableEntity and Reflection

A Deep Dive into Google Cloud's Data Warehouse Solution

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2