Frequently Asked Hadoop Questions

Frequently Asked Hadoop Questions

What is Hadoop?

Hadoop is an open-source software framework for storing large amounts of data and processing/querying those data on a cluster with multiple nodes of commodity hardware (i.e. low-cost hardware). In short, Hadoop consists of the following:

HDFS (Hadoop Distributed File System): HDFS allows you to store huge amounts of data in a distributed and a redundant manner. For example, a 1 GB (i.e 1024 MB) text file can be split into 16 * 128MB files and stored on 8 different nodes in a Hadoop cluster. Each split can be replicated 3 times for fault tolerance so that if 1 node goes down, you have backups. HDFS is good for sequential write-once-and-read-many times type access.

MapReduce: A computational framework. This processes large amounts of data in a distributed and parallel manner. When you do a query on the above 1 GB file for all users with age > 18, there will be say “8 map” functions running in parallel to extract users with age > 18 within its 128MB split file, and then the “reduce” function will run to combine all the individual outputs into a single final result.

YARN (Yet Another Resource Nagotiator): A framework for job scheduling and cluster resource management.Hadoop ecosystem, with 15+ frameworks & tools like Sqoop, Flume, Kafka, Pig, Hive, Spark, Impala, etc to ingest data into HDFS, to wrangle data (i.e. transform, enrich, aggregate, etc) within HDFS, and to query data from HDFS for business intelligence & analytics. Some tools like Pig and Hive are abstraction layers on top of MapReduce, whilst the other tools like Spark and Impala are improved architecture/design from MapReduce for much-improved latencies to support near real-time (i.e. NRT) and real-time processing.

Why Are Organizations Moving from Traditional Data Warehouse Tools to Smarter Data Hubs Based on Hadoop Ecosystems?

Organizations are investing to enhance their:

Existing data infrastructure:

  • predominantly using “structured data” stored in high-end and expensive hardwares
  • predominantly processed as ETL batch jobs for ingesting data into RDBMS and data warehouse systems for data mining, analysis & reporting to make key business decisions.
  • predominantly handle data volumes in gigabytes to terabytes

Smarter data infrastructure based on Hadoop where

  • structured (e.g. RDBMS), unstructured (e.g, images, PDFs, docs ), and semi-structured (e.g. logs, XMLs) data can be stored in cheaper commodity machines in a scalable and fault tolerant manner.
  • data can be ingested via batch jobs and near real time (i.e. NRT, 200ms to 2 seconds) streaming (e.g. Flume and Kafka).
  • data can be queried with low latency (i.e under 100ms) capabilities with tools like Spark & Impala.
  • larger data volumes in terabytes to petabytes can be stored.

This empowers organizations to make better business decisions with smarter and bigger data with more powerful tools to ingest datato wrangle stored data (e.g. aggregate, enrich, transform, etc.), and to query the wrangled data with low-latency capabilities for reporting and business intelligence.

How Does a Smarter & Bigger Data Hub Architectures Differ from a Traditional Data Warehouse Architectures?

Traditional Enterprise Data Warehouse Architecture

Hadoop-based Data Hub Architecture

What Are the Benefits of Hadoop-Based Data Hubs?

Improves the overall SLAs (i.e. Service Level Agreements) as the data volume and complexity grows. For example, “Shared Nothing” architecture, parallel processing, memory intensive processing frameworks like Spark and Impala, and resource preemption in YARN’s capacity scheduler.

Scaling data warehouses can be expensive. Adding additional high-end hardware capacities and licensing of data warehouse tools can cost significantly more. Hadoop-based solutions can not only be cheaper with commodity hardware nodes and open-source tools, but also can complement the data warehouse solution by offloading data transformations to Hadoop tools like Spark and Impala for more efficient parallel processing of Big Data. This will also free up the data warehouse resources.

Exploration of new avenues and leads. Hadoop can provide an exploratory sandbox for the data scientists to discover potentially valuable data from social media, log files, emails, etc., that are not normally available in data warehouses.

Better flexibility. Often business requirements change, and this requires changes to schema and reports. Hadoop-based solutions are not only flexible to handle evolving schemas, but also can handle semi-structured and unstructured data from disparate sources like social media, application log files, images, PDFs, and document files.

What Are Key Steps in Big Data Solutions?

Ingesting Data, Storing Data (i.e. Data Modelling), and processing data (i.e data wrangling, data transformations, and querying data).

Ingesting Data

Extracting data from various sources such as:

  1. RDBMs Relational Database Management Systems like Oracle, MySQL, etc.
  2. ERPs Enterprise Resource Planning (i.e. ERP) systems like SAP.
  3. CRM Customer Relationships Management systems like Siebel, Salesforce, etc.
  4. Social Media feeds and log files.
  5. Flat files, docs, and images.

And storing them on data hub based on “Hadoop Distributed File System”, which is abbreviated as HDFS. Data can be ingested via batch jobs (e.g. running every 15 minutes, once every night, etc), streaming near-real-time (i.e 100ms to 2 minutes) and streaming in real-time (i.e. under 100ms).

One common term used in Hadoop is “Schema-On-Read“. This means unprocessed (aka raw) data can be loaded into HDFS with a structure applied at processing time based on the requirements of the processing application. This is different from “Schema-On-Write”, which is used in RDBMs where schema needs to be defined before the data can be loaded.

Storing Data

Data can be stored on HDFS or NoSQL databases like HBase. HDFS is optimized for sequential access and the usage pattern of “Write-Once & Read-Many”. HDFS has high read and write rates as it can parallelize I/O s to multiple drives. HBase sits on top of HDFS and stores data as key/value pairs in a columnar fashion. Columns are clubbed together as column families. HBase is suited for random read/write access. Before data can be stored in Hadoop, you need consider the following:

  1. Data Storage Formats: There are a number of file formats (e.g CSV, JSON, sequence, AVRO, Parquet, etc.) and data compression algorithms (e.g snappy, LZO, gzip, bzip2, etc.) that can be applied. Each has particular strengths. Compression algorithms like LZO and bzip2 are splittable.
  2. Data Modelling: Despite the schema-less nature of Hadoop, schema design is an important consideration. This includes directory structures and schema of objects stored in HBase, Hive and Impala. Hadoop often serves as a data hub for the entire organization, and the data is intended to be shared. Hence, carefully structured and organized storage of your data is important.
  3. Metadata management: Metadata related to stored data.
  4. Multitenancy: As smarter data hubs host multiple users, groups, and applications. This often results in challenges relating to governance, standardization, and management.

Processing Data

Hadoop’s processing framework uses the HDFS. It uses the “Shared Nothing” architecture, which in distributed systems each node is completely independent of other nodes in the system. There are no shared resources like CPU, memory, and disk storage that can become a bottleneck. Hadoop’s processing frameworks like SparkPigHiveImpala, etc., processes a distinct subset of the data and there is no need to manage access to the shared data. “Sharing nothing” architectures are very scalable as more nodes can be added without further contention and fault tolerant as each node is independent, and there are no single points of failure, and the system can quickly recover from a failure of an individual node.


How Would You Go About Choosing Among the Different File Formats for Storing and Processing Data?

One of the key design decisions is regarding file formats based on:

  1. Usage patterns like accessing 5 columns out of 50 columns vs accessing most of the columns.
  2. Splittability to be processed in parallel.
  3. Block compression saving storage space vs read/write/transfer performance
  4. Schema evolution to add fields, modify fields, and rename fields.

CSV Files

CSV files are common for exchanging data between Hadoop & external systems. CSVs are readable and parsable. CSVs are handy for bulk loading from databases to Hadoop or into an analytic database. When using CSV files in Hadoop never include header or footer lines. Each line of the file should contain records. CSV files limited support for schema evaluations as new fields can only be appended to the end of a record and existing fields can never be limited. CSV files do not support block compression, hence compressing a CSV file comes at a significant read performance cost.

JSON Files

JSON records are different from JSON files; each line is its own JSON record. As JSON stores both schema and data together for each record, it enables full schema evolution and splitability. Also, JSON files do not support block level compression.

Sequence Files

Sequence files store data in binary format with a similar structure to CSV files. Like CSV, Sequence files do not store metadata, hence only schema evolution is appending new fields to the end of the record. Unlike CSV files, Sequence files do support block compression. Sequence files are also splittable. Sequence files can be used to solve “small files problem” by combining smaller XML files by storing the filename as the key and the file contents as the value. Due to complexity in reading sequence files, they are more suited for in-flight (i.e. intermediate) data storage.

Avro Files

These are suited for long term storage with schema. Avro files store metadata with data, but also allow specification of independent schema for reading the file. This enables full schema evolution support allowing you to rename, add, and delete fields and change data types of fields by defining a new independent schema. Avro file defines the schema in JSON format, and the data will be in binary JSON format. Avro files are also splitable and support block compression. More suited in usage patterns where row level access is required. This means all the columns in the row are queried. Not suited when a row has 50+ columns and the usage pattern requires only 10 or less columns to be accessed. Parquet file format is more suited for this columnar access usage pattern.

Columnar Formats, e.g. RCFile, ORC

RDBMs store records in a row-oriented fashion as this is efficient for cases where many columns of a record need to be fetched. Row-oriented writing is also efficient if all the column values are known at the time of writing a record to the disk. But this approach would not be efficient to fetch just 10% of the columns in a row or if all the column values are not known at the time of writing. This is where columnar files make more sense. So columnar format works well

  • skipping I/O and decompression on columns that are not part of the query
  • for queries that only access a small subset of columns.
  • for data-warehousing-type applications where users want to aggregate certain columns over a large collection of records.

RC & ORC formats are specifically written in Hive and not general purpose as Parquet.

Parquet Files

Parquet file is a columnar file like RC and ORC. Parquet files support block compression and optimized for query performance as 10 or less columns can be selected from 50+ columns records. Parquet file write performance is slower than noncolumnar file formats. Parquet also support limited schema evolution by allowing new columns to be added at the end. Parquet can be read and written to with Avro APIs and Avro schemas.

So, in summary, you should favor Sequence, Avro, and Parquet file formats over the others; Sequence files for raw and intermediate storage, and Avro and Parquet files for processing.



要查看或添加评论,请登录

社区洞察

其他会员也浏览了