Several Pivotal Conceptions about Hive
What is Apache Hive?
As the data grew in size, there was also the scarcity of Java developers who can write complex MapReduce jobs for Hadoop. Hence the advent of Hive comes which is created on top of Hadoop itself. Hive provides a SQL like a language termed HiveQL interface for users to extract data from a Hadoop system. With the simplicity provided by Hive to transform simple SQL queries into Hadoop’s MapReduce jobs, and runs them against a Hadoop cluster.
Apache Hive is well suited for Data warehousing applications in which case the data is structured, static and also formatted. As there are certain design constraints on Hive, it does not provide row-wise updates and inserts (which is coined as the biggest disadvantage of using Hive). As most Hive queries turn out into Map to Reduce jobs these queries will have higher latency due to start up overhead.
Based on these details, Hive is NOT a
- Relational database
- Design for OLTP (stands for Online Transaction Processing)
- Language for real-time queries and row-level updates
Features of Hive
With the basic understanding of what Apache Hive is, let us now take a look at all the features that are provided with this component of the Hadoop ecosystem:
- Hive stores the schema details in a database and processes the data into HDFS
- Hive is designed for OLAP (stands for Online Analytics Processing)
- Hive provides SQL like language for querying data, named as HiveQL or HQL (do not misinterpret this with HQL from Hibernate, which stands for Hibernate Query Language).
- Hive is a very fast, scalable and extensible component within the Hadoop ecosystem.
External Table
If there is data that is already existing in the HDFS cluster of Hadoop then an external Hive table is created to describe the data. These tables are called External tables, because they are going to be residing in the path specified by the LOCATION properties instead of the default warehouse directory (as described in the above paragraph).
When the data is stored in the external tables and when it is dropped, the metadata table is deleted but then the data is kept as is. This means that Hive evidently ignores the data that is presently residing in the path specified by LOCATION property and is left untouched forever. If you want to delete such data, then use the command to achieve the same:
1 | hadoop fs –rmr ‘tablename’ |
For example, we can add EXTERNAL keyword for specifying the external table.
1 | create external table table_name ( |
Internal or Managed Table
The tables that are created with the Hadoop Hive’s context, is very much similar to tables that are created on any of the RDBMS systems. Each of the tables that get created is associated with a directory configured within the ${HIVE_HOME}/conf/hive-site.xml in the Hadoop HDFS cluster.
By default, on a Linux machine, it is this path /user/hive/warehouse in HDFS. If there is a /user/hive/warehouse/match created by Hive in HDFS for a match table. All the data for the table is recorded in the same folder as mentioned above and hence such tables are called INTERNAL or MANAGED tables.
When the data resides in the internal tables, then Hive takes the full responsibility of maintaining the life-cycle of the data and the table in itself. Hence it is evident that the data is removed the moment when the internal tables are dropped.
By default Hive creates managed tables such as SQL below.
1 | create table table_name ( |
Partitioning of Table
Hive stores tables in partitions. Partitions are used to divide the table into related parts. Partitions make data querying more efficient.
For example, using SQL below.
1 | create table table_name ( |
Bucketing of Table
As we all know, Partition helps in increasing efficiency when performing a query on a table. Instead of scanning the whole table, it will only scan for the partitioned set and does not scan or operate on the unpartitioned sets, which helps us to provide results in lesser time and the details will be displayed very quickly because of Hive Partition.
Now, let’s assume a condition that there is a huge dataset. At times, even after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge and we want to manage the partition results into different parts. To overcome this problem of partitioning, Hive provides Bucketing concept, which allows the user to divide table data sets into more manageable parts.
The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.
In Hive Partition, each partition will be created as directory. But in Hive Buckets, each bucket will be created as file. Bucketing can also be done even without partitioning on Hive tables.
With the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement we can create bucketed tables.
We can create a bucketed_user table with above-given requirement with the help of the below HiveQL.
1 | CREATE TABLE bucketed_user( |
Row Format (Delimited / SerDe)
SerDe
Basically, SerDe is a short name for Serializer/Deserializer.
Deserializer
The Hive deserializer converts record (string or binary) into a Java object that Hive can process.
HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object
Serializer
Now, the Hive serializer will take this Java object, convert it into suitable format that can be stored into HDFS.
Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files
So, basically, a SerDe is responsible for converting the record bytes into something that can be used by Hive. Hive comes with several SerDe like JSon SerDe for JSon files, CSV SerDe for CSV files, etc.
For example, using JsonSerDe:
1 | ROW FORMAT SERDE |
For more specific SerDes, we can check the official site:
Row Format
You can create tables with a custom SerDe or using a native SerDe. A native SerDe is used if ROW FORMAT is not specified or ROW FORMAT DELIMITED is specified.
Use the SERDE clause to create a table with a custom SerDe.
Here is an example.
1 | CREATE TABLE table_name |
ROW FORMAT DELIMITED: This line is telling Hive to expect the file to contain one row per line. So basically, we are telling Hive that when it finds a newline character that means is a new record.
Fields Terminated
For example:
1 | CREATE TABLE table_name |
FIELDS TERMINATED BY ‘,’: This is really similar to the one above, but instead of meaning rows this one means columns, this way Hive knows what delimiter you are using in your files to separate each column. If none is set the default will be used which is ctrl-A.
Storage Format
Hive supports built-in and custom-developed file formats.
STORED AS TEXTFILE
Stored as plain text files. TEXTFILE is the default file format, unless the configuration parameter hive.default.fileformat has a different setting.
STORED AS SEQUENCEFILE
Stored as compressed Sequence File.
STORED AS AVRO
Stored as Avro format.
STORED AS JSONFILE
Stored as Json file format.
STORED AS ORC
Stored as Optimized Row Columnar (ORC) file format.
For example:
1 | CREATE TABLE table_name |
STORED AS TEXTFILE: This is to tell Hive what type of file to expect.
Import CSV File
Step 1 :
If we want to import data from external CSV file which using comma to separate columns, we can use the CREATE TABLE statement SQL below.
1
2
3
4
5
6
7
8
9
10
11
12create table table_name
(
id string,
name string,
age string
)
row format serde
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with
SERDEPROPERTIES
("separatorChar"=",","quotechar"="\"")
STORED AS TEXTFILEStep 2 :
Loading data from CSV file of HDFS.
1
load data local inpath '/home/hive/table_name.csv' overwrite into table table_name;