登录查看更多内容

File Formats in Apache HIVE

Prateek K.

Data Science

发布日期: 2017年7月31日

This Blog aims at discussing the different file formats available in Apache Hive. After reading this Blog you will get a clear understanding of the different file formats that are available in Hive and how and where to use them appropriately. Before we move forward let's discuss Apache Hive.

Apache Hive

Apache Hive is an open source data warehouse software that facilitates querying and managing of large datasets residing in distributed storage. Hive provides a language called HiveQL which allows users to query and is similar to SQL.

Like SQL, HiveQL handles structured data only. By default, Hive has derby database to store the data in it. We can configure Hive with MySQL database. As mentioned HiveQL can handle only structured data. Data is eventually stored in files. There are some specific file formats which Hive can handle such as:

? TEXTFILE

? SEQUENCEFILE

? RCFILE

? ORCFILE

Before going deep into the types of file formats lets first discuss what a file format is!

File Format

A file format is a way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format.

These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O.

Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.

Let us now discuss the types of file formats in detail.

TEXTFILE

TEXTFILE format is a famous input/output format used in Hadoop. In Hive if we define a table as TEXTFILE it can load data of from CSV (Comma Separated Values), delimited by Tabs, Spaces, and JSON data. This means fields in each record should be separated by comma or space or tab or it may be JSON(JavaScript Object Notation) data.

By default, if we use TEXTFILE format then each line is considered as a record.

We can create a TEXTFILE format in Hive as follows:

create table table_name (schema of the table) row format delimited fields terminated by ',' | stored as TEXTFILE.

At the end, we need to specify the type of file format. If we do not specify anything it will consider the file format as TEXTFILE format.

The TEXTFILE input and TEXTFILE output format are present in the Hadoop package as shown below:

org.apache.hadoop.mapred.TextInputFormat
org.apache.hadoop.mapred.TextOutputFormat

Let us see one example in Hive about how to create TEXTFILE table format, how to load data into TEXTFILE format and perform one basic select operation in Hive.

Creating TEXTFILE

create table olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile;

Here we are creating a table with name “olympic" and the schema of the table is as specified above. The data inside the above input file is delimited by tab space. As explained earlier the file format is specified as TEXTFILE at the end. The schema of the table created above can be checked using describe olympic;

We can load data into the created table as follows:

load data local inpath 'path of your file' into table olympic;

We have successfully loaded our input file data into our table which is of TEXTFILE format.

Now we will perform one basic SELECT operation on the data as shown below:

 select athelete from olympic;

The data retrieved is as shown in Figure below:

SEQUENCEFILE

We know that Hadoop's performance is drawn out when we work with a small number of files with big size rather than a large number of files with small size. If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files.

Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file.

In Hive we can create a sequence file by specifying STORED AS SEQUENCEFILE in the end of a CREATE TABLE statement.

There are three types of sequence files:

? Uncompressed key/value records.

? Record compressed key/value records - only 'values' are compressed here

? Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

Hive has its own SEQUENCEFILE reader and SEQUENCEFILE writer libraries for reading and writing through sequence files.

In Hive we can create a sequence file format as follows:

create table table_name (schema of the table) row format delimited fileds terminated by ',' | stored as SEQUENCEFILE

Hive uses the SEQUENCEFILE input and output formats from the following packages:

org.apache.hadoop.mapred.SequenceFileInputFormat
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Creating SEQUENCEFILE

create table olympic_sequencefile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as sequencefile

Here we are creating a table with name olympic_sequencefile and the schema of the table is as specified above and the data inside my input file is delimited by tab space. At the end the file format is specified as SEQUENCEFILE format. You can check the schema of your created table using:

 describe olympic_sequencefile;

Now to load data into this table is somewhat different from loading into the table created using TEXTFILE format. You need to insert the data from another table because this SEQUENCEFILE format is the binary format. It compresses the data and then stores it into the table. If you want to load directly as in TEXTFILE format that is not possible because we cannot insert the compressed files into tables.

So to load the data into SEQUENCEFILE we need to use the following approach:

INSERT OVERWRITE TABLE olympic_sequencefile
SELECT * FROM olympic;

We have already created table by name olympic which is of TEXTFILE format and then we are writing the contents of the olympic table into olympic_sequencefile table.

Thus we have successfully loaded data into the SEQUENCEFILE.

Now let us perform the same basic SELECT operation which we have performed on the TEXTFILE format, on SEQUENCEFILE format also.

 select athelete from olympic_sequencefile;

RCFILE

RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.

RCFILE is used when we want to perform operations on multiple rows at a time.

RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage.

This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type.

Facebook uses RCFILE as its default file format for storing of data in their data warehouse as they perform different types of analytics using Hive.

In Hive we can create a RCFILE format as follows:

 create table table_name (schema of the table) row format delimited fields terminated by ',' | stored as RCFILE

Hive has its own RCFILE Input format and RCFILE output format in its default package:

org.apache.hadoop.hive.ql.io.RCFileInputFormat
org.apache.hadoop.hive.ql.io.RCFileOutputFormat

Creating RCFILE

create table olympic_rcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as rcfile

Here we are creating a table with name olympic_rcfile and the schema of the table is as specified above. The data inside the input file is delimited by tab space.

At the end the file format is specified as RCFILE format.

You can check the schema of your created table using:

 describe olympic_rcfile;

We cannot load data into RCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created RCFILE as shown below:

INSERT OVERWRITE TABLE olympic_rcfile
SELECT * FROM olympic;

We have already created a table by name olympic which is of TEXTFILE format and then we are writing the contents of the olympic table into olympic_rcfile table.

As shown above we have successfully loaded data into the RCFILE.

Now let us perform the same basic SELECT operation which we have performed on the TEXTFILE format on RCFILE format as well as shown in Figure below:

select athelete from olympic_rcfile;

ORCFILE

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.

An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.

In Hive we can create a RCFILE format as follows:

 create table table_name (schema of the table) row format delimited fields terminated by ',' | stored as ORC

Hive has its own ORCFILE Input format and ORCFILE output format in its default package:

 org.apache.hadoop.hive.ql.io.orc

Creating ORCFILE

create table olympic_orcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as orcfile;

Here we are creating a table with name olympic_orcfile and the schema of the table is as specified above. The data inside the input file is delimited by tab space.

At the end the file format is specified as ORCFILE format.

You can check the schema of your created table using:

 describe olympic_orcfile;

We cannot load data into ORCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created ORCFILE.

INSERT OVERWRITE TABLE olympic_orcfile
SELECT * FROM olympic;

Here is a table created by name olympic which is of TEXTFILE format and then we need to write the contents of the olympic table into olympic_orcfile table.

Thus we have successfully loaded data into the ORCFILE.

Now let us perform the same basic SELECT operation which we have performed on the TEXTFILE format, on ORCFILE format as well.

select athelete from olympic_orcfile;

Thus you can use the above four file formats depending on your data.

For example,

a) If your data is delimited by some parameters then you can use TEXTFILE format.

b) If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.

c) If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.

d) If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.

Hope with this Blog you now have a clear picture as to which File Format to use in Hive depending on your data.

Keep visiting acadgild.com for more updates on Bigdata and other technologies.

要查看或添加评论，请登录

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

2018年1月17日

Why Becoming a Data Scientist is the Next Logical Move?

Amongst the variety of job profiles, Data Scientist might not be the first job to come to your mind. But big companies…
Frequently Asked Hadoop Interview Questions in 2017 Part - 2

2017年9月20日

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Before going through this Hadoop interview questions part-2, we recommend our users to go through our previous post on…

2 条评论
Hive UseCase: Breast Cancer Data Analysis

2017年9月18日

Hive UseCase: Breast Cancer Data Analysis

This blog will help readers to learn data analytics using HIVE. The very common problem present in whole world coming…
Data Serialization with Avro in Hive

2017年8月16日

Data Serialization with Avro in Hive

This blog focuses on providing in-depth information of Avro in Hive. Here we have discussed the importance and…
Writable and WritableComparable in Hadoop

2017年7月28日

Writable and WritableComparable in Hadoop

This blog helps those people who want to build their own custom types in Hadoop which is possible only with Writable…

1 条评论
Frequently Asked Hadoop Interview Questions in 2017 – Part 1

2017年7月19日

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

In this first Part of Hadoop interview Questions, we would be discussing various questions related to Big Data Hadoop…

1 条评论
Static vs dynamic partition in hive

2017年4月14日

Static vs dynamic partition in hive

In our Last post, we mentioned regarding Introduction to Hive Partition therein post we clearly mentioned the most…

1 条评论
Strict Mode HIVE

2017年3月24日

Strict Mode HIVE

We have the partition in Hive to save lots of our time whereas performing analysis on the area only we want and not the…

2 条评论
Solving the Unstructured Data Dilemma

2017年3月12日

Solving the Unstructured Data Dilemma

A web browser generates tons of data every day. All of which may not be stored in a structured format.

See all articles

File Formats in Apache HIVE

Prateek K.

Data Science

Prateek K.的更多文章

社区洞察

其他会员也浏览了

Friends don't let friends use MongoDB!

From ElasticSearch back to SQL Server

NewSQL: is the code back in the DB?

Apache Hive Performance Tuning Best Practices

Comparing Apache Iceberg, Delta Lake, and Parquet: Optimizing Big Data Performance Beyond Traditional SQL Databases

Amazon Athena

Self-Reflection of MongoDB-Workshop

Apache Drill, not bad at all!

Beginners Guide to Apache HIVE.

Postgres and MySQL SQL Language/Query Differences

Prateek K.的更多文章

Why Becoming a Data Scientist is the Next Logical Move?

Frequently Asked Hadoop Interview Questions in 2017 Part - 2

Hive UseCase: Breast Cancer Data Analysis

Data Serialization with Avro in Hive

Writable and WritableComparable in Hadoop

Frequently Asked Hadoop Interview Questions in 2017 – Part 1

Static vs dynamic partition in hive

Strict Mode HIVE

Solving the Unstructured Data Dilemma

社区洞察

其他会员也浏览了

Friends don't let friends use MongoDB!

From ElasticSearch back to SQL Server

NewSQL: is the code back in the DB?

Apache Hive Performance Tuning Best Practices

Comparing Apache Iceberg, Delta Lake, and Parquet: Optimizing Big Data Performance Beyond Traditional SQL Databases

Amazon Athena

Self-Reflection of MongoDB-Workshop

Apache Drill, not bad at all!

Beginners Guide to Apache HIVE.

Postgres and MySQL SQL Language/Query Differences