Is cloud replacing Hadoop?
Gourav Sikka
GCP Data engineer| BigData | GCP | Goggle Bigquery | SAS|Dataproc |GCS|Tableau | Sqoop |Hadoop | Hive |DBT |Dremio |Power BI| DataLake |Datawarehouse |Databricks|Airflow|Cloud composer|Fivetran|miro|Control-M
What is Hadoop?
Hadoop is a framework to solve Big Data problems.
How Hadoop comes into pictures:-
Google released a paper to describe how to solve large datasets in 2003.
This paper was called as GFS(Google file system)
In 2004:- Google released another paper to describe how to process large datasets.
??????????? This paper was called mapreduce.
In 2006 :- Yahoo took these papers and implemented it.
The Implementation of GFS was named as HDFS( Hadoop distributed file system)
What is google Bigquery ?
Bigquery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds.
Big query is using dremel technology.
Dremel :- Dremel is a distributed system developed at Google for interactively querying large datasets. Dremel is the query engine used in Google's BigQuery service.?
·? ? ? ? Bigquery is the public implementation of dremel.
·? ? ? ? Dremel can scan 35 billion rows without an index in terms of seconds.
·? ? ? ? Dremel the cloud powered? massively parallel query service, Share google infrastructure, So that it can parallelize each query and run it on tens of thousands of servers? simultaneously.
·? ? ? ? Dremel has high scalability.
·? ? ? ? Dremel is very fast.
·? ? ? ? As far as replication is concerned, each tablet is usually three-way replicated
Architecture of bigquery:-
Data is stored in a proprietary columnar format.
Why is bigquery too fast?
·? ? ? ? Columnar storage: The Data is stored by columns and this makes it possible to achieve a very high compression ratio and scan throughput.
·? ? ? ? Tree architecture: a tree execution architecture is used to dispatch queries and aggregate results across thousands of machines.
Columnar storage:- Storage is performed by columns, thus you only access fewer and different storage volumes (which is even faster since you can access them in parallel).
Tree architecture for query execution:-
The root node receives the query, reads the table metadata and reroutes the query to the next level. At the bottom, the leaf nodes are the ones interacting with the distributed file system, retrieving the
actual data, and propagating it back up in the tree.
Query dispatcher:-
Usually, several queries are executed at the same time, a query dispatcher schedules queries and balances the load.
The amount of data processed in each query is usually larger than the number of processing units available for execution (slots).
BigQuery can also target external data sources with its queries. The supported sources are Bigtable, Google Cloud Storage, and Google Drive. The data is loaded on-the-fly into the Dremel engine.
HDFS:- Hadoop distributed file system:-
Hadoop 1.0:-?
HDFS:- HDFS for distributed storage
MapReduce:- Mapreduce for distributed processing.
2009:- Hadoop comes under apache software foundation and becomes open source.
2013:- Apache released Hadoop 2.0 to provide major performance enhancement.
领英推荐
Hadoop 2.0:-
YARN:- Yet another resource negotiator.
YARN is similar to an operating system.
YARN is managing all the 100 systems or computers.
YARN is a store manager.
YARN is a resource manager.?
Hadoop Ecosystem:-
HIVE:-?
SQOOP:-
HBASE:-
OOZIE:-?
APACHE SPARK:-
Spark Cluster:-
Google Cloud Platform (GCP):-
?The Google Cloud Platform (GCP) offers a wide range of analytics tools, all built with unique capabilities for data analytics and management. Google's artificial intelligence (AI) and machine learning (ML) solutions, for example, can be integrated into existing tooling to provide real-time intelligence.
?BigTable:-
Google Bigquery:-?
Dataproc:-?
Dataflow:-
Pub-sub:-