Is cloud replacing Hadoop?

Is cloud replacing Hadoop?

What is Hadoop?

Hadoop is a framework to solve Big Data problems.

How Hadoop comes into pictures:-

Google released a paper to describe how to solve large datasets in 2003.

This paper was called as GFS(Google file system)

In 2004:- Google released another paper to describe how to process large datasets.

??????????? This paper was called mapreduce.

In 2006 :- Yahoo took these papers and implemented it.

The Implementation of GFS was named as HDFS( Hadoop distributed file system)

What is google Bigquery ?

Bigquery is a query service that allows us to run SQL-like queries against multiple terabytes of data in a matter of seconds.

Big query is using dremel technology.

Dremel :- Dremel is a distributed system developed at Google for interactively querying large datasets. Dremel is the query engine used in Google's BigQuery service.?

·? ? ? ? Bigquery is the public implementation of dremel.

·? ? ? ? Dremel can scan 35 billion rows without an index in terms of seconds.

·? ? ? ? Dremel the cloud powered? massively parallel query service, Share google infrastructure, So that it can parallelize each query and run it on tens of thousands of servers? simultaneously.

·? ? ? ? Dremel has high scalability.

·? ? ? ? Dremel is very fast.

·? ? ? ? As far as replication is concerned, each tablet is usually three-way replicated

Architecture of bigquery:-

Data is stored in a proprietary columnar format.

Why is bigquery too fast?

·? ? ? ? Columnar storage: The Data is stored by columns and this makes it possible to achieve a very high compression ratio and scan throughput.

·? ? ? ? Tree architecture: a tree execution architecture is used to dispatch queries and aggregate results across thousands of machines.

Columnar storage:- Storage is performed by columns, thus you only access fewer and different storage volumes (which is even faster since you can access them in parallel).

Tree architecture for query execution:-

The root node receives the query, reads the table metadata and reroutes the query to the next level. At the bottom, the leaf nodes are the ones interacting with the distributed file system, retrieving the

actual data, and propagating it back up in the tree.

Query dispatcher:-

Usually, several queries are executed at the same time, a query dispatcher schedules queries and balances the load.

The amount of data processed in each query is usually larger than the number of processing units available for execution (slots).

BigQuery can also target external data sources with its queries. The supported sources are Bigtable, Google Cloud Storage, and Google Drive. The data is loaded on-the-fly into the Dremel engine.

HDFS:- Hadoop distributed file system:-

Hadoop 1.0:-?

HDFS:- HDFS for distributed storage

MapReduce:- Mapreduce for distributed processing.

2009:- Hadoop comes under apache software foundation and becomes open source.

2013:- Apache released Hadoop 2.0 to provide major performance enhancement.

Hadoop 2.0:-

  • Mapreduce
  • YARN
  • HDFS
  • Mapreduce is divided into two parts :- 1 MapReduce 2: YARN

YARN:- Yet another resource negotiator.

YARN is similar to an operating system.

YARN is managing all the 100 systems or computers.

YARN is a store manager.

YARN is a resource manager.?

Hadoop Ecosystem:-

  1. HIVE
  2. APACHE SPARK
  3. SQOOP
  4. HBASE
  5. YARN
  6. MAPREDUCE
  7. HDFS
  8. OOZIE

HIVE:-?

  • Data Warehouse tool.
  • Data Warehouse tool built on top of apache hadoop for providing data query and ? analysis.
  • HIVE converts SQL programs to Mapreduce, so that complexity will be hidden.
  • HIVE is developed by facebook.
  • HIVE uses HQL:- HIVE QUERY LANGUAGE.

SQOOP:-

  • A command-line interface application for transferring data between relational databases and hadoop.
  • It is used to transfer the data.
  • We have sqoop import and sqoop export commands to transfer the data.
  • Sqoop is a data migration tool.
  • Sqoop is also a mapreduce job.

HBASE:-

  • A column oriented NOSQL Database that runs on top of HDFS.

OOZIE:-?

  • It's a scheduler.
  • A workflow scheduler system to manage apache hadoop jobs.

APACHE SPARK:-

  • Spark is an alternative to Mapreduce.
  • Spark is better than mapreduce.
  • Spark is a general purpose in-memory compute engine.
  • We are using spark for compute engine.
  • Spark is written in scala.
  • However Spark officially supports Java, Scala, python and R.
  • Spark is a plug and play Compute engine.
  • Spark can be plugged in with any storage system like HDFS, Local storage, AMAZON S3.
  • Spark can be plugged in with any resource manager like YARN, MESOS, KUBERNETES.

Spark Cluster:-

  • For compute:- SPARK
  • For storage:- HDFS
  • Resource manager :- YARN.

Google Cloud Platform (GCP):-

?The Google Cloud Platform (GCP) offers a wide range of analytics tools, all built with unique capabilities for data analytics and management. Google's artificial intelligence (AI) and machine learning (ML) solutions, for example, can be integrated into existing tooling to provide real-time intelligence.

?BigTable:-

  • It's similar to Hbase.
  • Its NOSQL database.
  • Bigtable is a managed NoSQL database service designed to handle massive workloads while maintaining high performance.?
  • It is used to power core Google services, such as Search, Analytics, Maps and Gmail.?
  • Bigtable uses a low-latency storage stack and is globally available.

Google Bigquery:-?

  • Bigquery is similar to Hive.
  • Bigquery is a data warehouse.
  • BigQuery is a fully managed enterprise data warehouse that helps you manage and analyze your data with built-in features like machine learning, geospatial analysis, and business intelligence.?
  • BigQuery interfaces include Google Cloud Console interface and the BigQuery command-lin.
  • query you can write SQL type statements to fetch the data. It mainly handles structure data.

Dataproc:-?

  • Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark).
  • Cloud Dataproc can both be used to implement ETL data warehousing solutions.
  • Dataproc should be used if the processing has any dependencies to tools in the Hadoop ecosystem.
  • Dataproc is a managed service for processing large datasets, such as those used in big data initiatives.

Dataflow:-

  • Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP.
  • This can be used for ?both batch processing and streaming.
  • Here we don't need to focus on "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works.
  • It creates jobs based on "templates," which can help simplify common tasks where the differences are parameter values.
  • It enables developers to set up processing pipelines for integrating, preparing and analyzing large data sets, such as those found in Web analytics or big data analytics applications.

Pub-sub:-

  • Google Cloud Pub/Sub provides messaging between applications.
  • Publisher applications can send messages to a "topic" and other applications can subscribe to that topic to receive the messages.
  • Pub/Sub is used for streaming analytics and data integration pipelines to ingest and distribute data.
  • Cloud Pub/Sub is a queue service, imagine like a database. You still need something between Cloud Pub/Sub and Bigquery which executes the jobs that are waiting in the queue.?
  • For this people often use DataFlow, but you can implement your own worker to read from Pub/Sub and write to BigQuery.








要查看或添加评论,请登录

Gourav Sikka的更多文章

  • How do you know if you self-sabotage?

    How do you know if you self-sabotage?

    Self-sabotage is when you undermine your own goals and values. In other words, you acknowledge that there's something…

社区洞察

其他会员也浏览了