26 October 2017
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis.This blog explains some of the optimization techniques that can be applied in hive query.
Partitions are fundamentally horizontal slices of data which allow large sets of data to be segmented into more manageable chunks. When you define the table with partitions, the underlying structure is changed such that sub-directories are created for each slicer or partitioning column.
The partitions are a physical segmenting of the data - where the partition is maintained in the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want.
Partition columns are virtual columns, they are not part of the data itself but are derived on load.Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria
Below is a sample Example
CREATE TABLE IF NOT EXISTS partitioned_user( firstname VARCHAR(64), lastname VARCHAR(64), address STRING, email STRING) COMMENT 'User Details Detals' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY PARTITIONED BY (date STRING) STORED AS TEXTFILE;
To set a limit on Maximum number of Partition in Hive use below properties.
A partition is called dynamic partition in hive when there is partition in both the file system and metastore when data is inserted in hive table.
There are two components to a partition:
This entry is essentially just the pair (partition values, partition location).
In Hive, as data is written to disk, each partition of data will be automatically split out into different folders.For Dynamic Partition When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. Dynamic partitions are more useful when there is large amount of data and you don’t know how many unique values exist for that column.
we need to Set following two properties to use Dynamic Partition
hive>SET hive.exec.dynamic.partition =true; hive>SET hive.exec.dynamic.partition.mode=nonstrict;
Bucketing is a technique that allows you to cluster or segment large sets of data to optimize query performance. Unlike partitioning where each value for the slicer or partition key gets its own space, in clustering a hash is taken for the field value and then distributed across buckets.
Buckets allow the system to prune large quantities of data during query processing, resulting in faster query execution.It also helps in doing efficient map-side joins.
In Hive ,indexes is used to improve the speed of query for certain columns of table .The improvement in query speed that an index can provide comes at the cost of additional processing to create the index and disk space to store the index.
There are 3 available options for
select * from <table-name>
This is just a read operation for hive engine. So data is simple read from your file and dumped on to the screen. No computations are involved in this so the default execution engine, i.e., map-reduce jobs won’t start.
It’s an optimization technique. Hive, fetch, task conversion property can minimize the latency of map-reduce overhead. When queried like SELECT, FILTER LIMIT queries, this property skips map reduce and using FETCH task. As a result, Hive can execute queries without running MapReduce task.
select <col1 ,col2> from <table-name>
This operation requires projection, i.e., some sort of computational operation needs to be performed before showing you the result. So the map-reduce jobs start and its takes a while before you get the result.
If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.
While you create a table in Hive, you specify the delimiter to let Hive know the format of the data you have in the input file.
CREATE TABLE emp( Id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;
In the above Hive create table example, ‘\t’ is the delimiter. The default record delimiter in Hive is − \n And the field delimiters in Hive are − \001,\002,\003
ROW FORMAT should have delimiters used to terminate the fields and lines as shown above.
It is possible to load data to hive without inserting or load the data in hive
we can do so by creating a table from anther existing table. Here is the syntax for that-
CREATE TABLE tbl1 row format delimited fields terminated by ‘\t’ AS select * from tbl2;
You can also create an external table in Hive using HBase as below-
CREATE EXTERNAL TABLE hbase_hive_names(fields_names) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key,id:id,name:fn,name:ln,age:age”) TBLPROPERTIES(“hbase.table.name” = “hbase_2_hive_names”);
A skewed table is a special type of table where the values that appear very often (heavy skew) are split out into separate files and rest of the values go to some other file. Create Hive Skewed Table Syntax
create table T (c1 string, c2 string) skewed by (c1) on (‘x1’)
The skewed table is again one of the important Hive query optimization techniques.
 Hive Joins