5 October 2017
Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing platform for data summarization, ad-hoc query, data aggregration and analysis of datasets.The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data stored in Apache Hadoop.
Even though Derby database is the default metastore in Hive ,we can change it by editing hive-site.xml in the hive installation directory .
<configuration><property>
<name>hive.metastore.local</name>
<value>true</value></property>
<property><name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<ip address of mysqlmachine>/hive?createDatabaseIfNotExist=true</value>
</property>
<property><name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value></property>
<property><name>javax.jdo.option.ConnectionUserName</name>
<value>username</value></property>
<property><name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
</configuration>
describe <table name>;
describe extended <tablename>;
Given below Sample Table
hive> SELECT * FROM employees LIMIT 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 |
| 10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21 |
| 10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28 |
| 10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01 |
| 10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 |
| 10006 | 1953-04-20 | Anneke | Preusig | F | 1989-06-02 |
| 10007 | 1957-05-23 | Tzvetan | Zielinski | F | 1989-02-10 |
| 10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15 |
| 10009 | 1952-04-19 | Sumant | Peac | F | 1985-02-18 |
| 10010 | 1963-06-01 | Duangkaew | Piveteau | F | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
SELECT DISTINCT birth_date,first_name,last_name,gender,hire_date
FROM employees
GROUP BY last_name;
ALTER TABLE <Table Name> CHANGE <column name> <column name> <New Data Type>;
//Example
ALTER TABLE Employee CHANGE salary salary BIGINT;
ALTER TABLE hive_table_name RENAME TO new_table_name;
It returns all rows of data matched in both columns
// Joining two tables on common emp_no
hive> SELECT * FROM employees e JOIN salaries s ON (e.emp_no = s.emp_no) LIMIT 5;
+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date | emp_no | salary | from_date | to_date |
+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 60117 | 1986-06-26 | 1987-06-26 |
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 62102 | 1987-06-26 | 1988-06-25 |
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66074 | 1988-06-25 | 1989-06-25 |
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66596 | 1989-06-25 | 1990-06-25 |
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 | 10001 | 66961 | 1990-06-25 | 1991-06-25 |
+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
returns all the rows from the left table, even if there are no matches in the right table.It returns null from right table if
SELECT * FROM employees e LEFT OUTER JOIN salaries s ON (e.emp_no = s.emp_no) LIMIT 5;
As of Hive version 0.14.0: INSERT…VALUES, UPDATE, and DELETE are now available with full ACID support.
INSERT … VALUES Syntax:
INSERT INTO TABLE <tablename> [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
Where values_row is: ( value [, value …] ) where a value is either null or any valid SQL literal
UPDATE Syntax:
UPDATE <tablename> SET column = value [, column = value ...] [WHERE expression]
DELETE Syntax:
DELETE FROM <tablename> [WHERE expression]
Additionally, from the Hive Transactions doc:
If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.
Let us suppose we have a table employee consisting of name and salary
SELECT *
FROM (SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_no FROM employee GROUP BY SALARY) res
WHERE res.row_no = 2 ;
SELECT * FROM
(SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_num FROM
employee GROUP BY SALARY) res
WHERE res.row_num = 3;
This can be done by setting the MapReduce jobs to execute in strict mode by using set hive.mapred.mode=strict;
Use of this strict mode ensures that the queries on partitioned table cannot execute
The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
The ORDER BY clause is familiar from other SQL dialects. It performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.
ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn’t sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.
Build a Jar file containing UDF and add it to Hive instance to be used later.
hive> add jar customUDF.jar;
hive> create temporary function customUDF
as 'com.nitendragautam.udf.GenericUDF';
hive> select customUDF(col1, col2, col3) from partitioned_user limit 5;
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions. you need to add partition for every partition
ALTER TABLE <tablename> ADD PARTITION
Or in short you can run
MSCK REPAIR TABLE <tablename>;
available since Hive 0.11 It will add any partitions that exist on HDFS but not in metastore to the metastore.
Copy the partition folders and data to a table folder. Create a table with dynamic partitions on the table folder. Run MCSK Repair on the new table.
An Index is a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.
CREATE INDEX indexed_salary ON TABLE employee (salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
This creates an index to salary
column .
Following are some of the aggregate functions in Hive.
Counts the row count,distinct column count
Gives the sum of Column and sum of distinct column
Returns the minimum of the column in the group
Returns the maximum value of the column in the group
Returns the variance of a numeric column in the group
Returns the unbiased sample variance of a numeric column in the group
Returns the standard deviation of a numeric column in the group
Returns the unbiased sample standard deviation of a numeric column in the group
Returns the population covariance of a pair of numeric columns in the group
Returns the sample covariance of a pair of a numeric columns in the group
Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group
Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
Returns the exact percentiles p1, p2, … of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.
Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights
Returns a set of objects with duplicate elements eliminated
Returns a list of objects with duplicates (as of Hive 0.13.0)
This function divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.).
Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.
Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data. Windowing in Hive is introduced from Hive 0.11.
Some analytics functions that hive provides are given below
Analytics functions
[1] Hive Confluence