BigQuery: Chapter 1A - An Introduction
Kunal Mehta
Global Data Platform Head | Product Owner | Associate Director | Data Analytics | Google Analytics | Adobe Analytics | Google Cloud Platform | Machine Learning | Data Science | Data Engineering | Speaker
Hello Everyone, so, let’s start discussion on BigQuery today, the platform, where it’s coming from, it’s architecture, why it’s as fast as it is, and most importantly, why you, as a marketer, would want to use this technology. Before you go any further, my request would be to go through the below 2 articles, to understand what we are trying to do here:
As explained in the earlier post, this would be first part of the series where we will talk about the underlying technology and it’s architecture. In this part I will stop short of giving you the use cases, answer to why BigQuery, because I think it would be overdose at this point, but in subsequent articles, I will give more details from marketer's perspective about how and why we should use this platform. Hope giving bite size information will help everyone scale up little easy.
Introducing Dremel
Google has been dealing with Big Data, before it was cool. For them a simple question like, how many impressions of a specific keyword they had for one region, is a Big Data query. The volume of the data is just too big for any standard database to handle, and any DBMS language to provide a quick answer. This is the reason why Google came up with it’s in-house core technology ‘Dremel’. The idea was to create a query service, which can be used to write SQL queries, with only difference, on very large datasets. The idea behind focusing on SQL was that since it’s simple, and many folks, including the ones without any technical background, have an exposure to it, would make it easy to be adopted for different operations within Google, like Tech Support, Technical Account Managers, analysts, etc.
BigQuery is nothing but public implementation of the same Dremel technology, which Google has provided to third party developers. So, much like many well known cloud technologies out there, BigQuery is also a Google product, which was developed for Google, by Google, and probably they never thought that one day they will provide this platform to external developers to utilize.
BigQuery is made available to developers through REST APIs, Command line interface, Traditional GUI interface and now a new revamped GUI interface. To give you an idea about the power of underlying Dremel, it can scan more than 30 billion rows without index in tens of seconds. Now this point might be confusing to traditional Data Warehouse experts, how queries would be managed when database is not indexed in the first place? And then how can it be so fast without any query management? Google resolves this problem by 1) creating a cloud powered infrastructure where each query can be parallelized and run on tens of thousands of servers simultaneously 2) by using a much improved data storage technology, which compresses on columns, rather than rows, and each column of data is stored on it’s own storage volume. We will devote more space and time on 2nd point in later sections.
Thanks to Google, there’s no cost to maintain this size of infrastructure for users, the pricing system of BigQuery is so robust and effective that it by itself makes it one of the most lucrative options in the market.
In addition to SQL, Dremel also enables it’s users to utilize Regular Expressions for string matching within the same SQL syntax so that it can deal with String data in much more efficient manner. Infact if your query is taking longer than few tens of seconds, you really need to check what’s going on, since it has performance of output of processing and spitting out results for queries which scanned 20 TB of data within tens of seconds.
A view of underlying Architecture
As promised earlier, let’s understand the architecture which makes BigQuery as blazing fast as it is. There are 2 key aspects to Dremel’s technology that makes it as fast:
1) Columnar Storage
2) Tree Architecture
Think about how data is stored? To compress data size and then makes querying faster, redundancies are first removed from the data and then rows with removed redundancies are stored on the same storage volumes. This is something that is done on most of the traditional databases. But even better solution would be if we can remove redundancies on Columns. Take a column Country for example, now there would always be more cases where country name would appear more than once, and hence these redundancies can be removed on column level rather than row level. On top of that, each of the column can be stored in it’s own storage volume. Now this is again something which someone like Google can do, for their immense infrastructure that they are having already.
This kind of storage again gives you 2 straight advantages:
1) Traffic minimization: The moment you put your Select statement, with column names that you want, the query will only be applied on the relevant storage volumes. So, even when you are querying a dataset which might be 40 GB, your query might be running only on 10 GB of data, because you selected 5 columns in your Select statement.
2) Compression Ratio: Now this is no rocket science, since cardinality, that is variation in possible values, of columns is much lower for columns, than for rows, so, data can be compressed as much as 1:10 times, rather than 1:3 times in traditional databases.
At this point 2nd challenge for Dremel was how to first dispatch queries in multiple of storage volumes and then combine the results across tens of thousands of machines in seconds? This was addressed by developing a massively distributed Tree Architecture, for first pushing a query to a tree and then aggregating the results from leaves.
How BigQuery ties up with Dremel?
Few years ago Google made BigQuery as a publicly available service for developers to utilize the power of Dremel for their own Big Data requirements.
Some of you might have heard about terms like MapReduce or Hadoop, and at this point might be wondering how BigQuery would be having any additional advantage over that system? To the uninitiated, MapReduce is a Big Data processing programming model, which is implemented by open sources like Hadoop. Fact is that MapReduce was actually invented by Google, and till 2014 was used by them to query their Big Data requirements. But since the Google’s operations required a faster framework, MapReduce wasn’t really cutting it for them, and that’s why MapReduce and it’s internal UI for the same, was discontinued, and Google developed Dremel as a faster alternative to MapReduce, though both at a high level, utilize the power of massive parallel computing infrastructures.
BigQuery as a Data Warehouse
Now this is a little difficult question to answer, fact is that BigQuery is not really a Data Warehouse, but then another fact is that most of the times Data Warehouses are used for simple use cases like BI, and then for that there are simple operations running in the background like creating Relational databases, or cubes or data marts, in-memory databases, parallel I/O. Some of these simple operations can be very expensive like in case of creating in-memory database, or complicated, like in case of creating indices for relational databases, or not as easy for instantaneous requests, like when you have to have data cubes before you run any query against it. For these reasons it’s easier and better to utilize the power of Dremel and take advantage of thousands of disk drives, processors and servers maintained by Google to run your own queries.
Can you use your own Big Data with Big Query?
Absolutely you can, though first you will have to import your Big Data to Google Cloud Storage, and then linking BigQuery with the same. None of these operations are either difficult or very expensive, so highly recommended. I still will not recommend to replace your Big Data solution with BigQuery, because whereas your own implementation of MapReduce can help you for your data mining operations, long running batch operations, BigQuery definitely is a better alternative for BI queries which require faster results.
Martech Consulting | CDPs | Adobe Suit | Product Marketing | ex - Dentsu, Sapient
6 年Loved this, eagerly waiting for next article. Precise, crisp and so much to learn.
Digital Analytics Lead | CRO & Personalisation practitioner | Growth Strategist
6 年Well explained Kunal. Will wait for further chapters...
Go-to-marketer - Global Markets | Strategy, Planning and Execution | Data & Analytics, AI & Automation, Agile DevOps, and Marketplace
6 年Amazing Sir! Your style of explaining things has always been so sorted, in depth and simple. Thank you.
Very nicely articulated Kunal - I will keep reading the rest of the series too.