Apache Hive Query Language

5 October 2017

Changing the default Metastore in Hive

Even though Derby database is the default metastore in Hive ,we can change it by editing hive-site.xml in the hive installation directory .

<configuration><property>
<name>hive.metastore.local</name>
<value>true</value></property>
<property><name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<ip address of mysqlmachine>/hive?createDatabaseIfNotExist=true</value>
</property>
<property><name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value></property>
<property><name>javax.jdo.option.ConnectionUserName</name>
<value>username</value></property>
<property><name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
</configuration>   

Hive Queries

Table definition in Hive

describe <table name>; 

Detailed information about the table

describe extended <tablename>; 

SELECT Unique Records in Hive

Given below Sample Table

 hive> SELECT * FROM employees LIMIT 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date  |
+--------+------------+------------+-----------+--------+------------+
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |
|  10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21 |
|  10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28 |
|  10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01 |
|  10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12 |
|  10006 | 1953-04-20 | Anneke     | Preusig   | F      | 1989-06-02 |
|  10007 | 1957-05-23 | Tzvetan    | Zielinski | F      | 1989-02-10 |
|  10008 | 1958-02-19 | Saniya     | Kalloufi  | M      | 1994-09-15 |
|  10009 | 1952-04-19 | Sumant     | Peac      | F      | 1985-02-18 |
|  10010 | 1963-06-01 | Duangkaew  | Piveteau  | F      | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
SELECT DISTINCT birth_date,first_name,last_name,gender,hire_date
FROM employees 
GROUP BY last_name;

Change Column Data Type in Hive

ALTER TABLE <Table Name> CHANGE <column name> <column name> <New Data Type>;

//Example
ALTER TABLE Employee CHANGE salary salary BIGINT;

Rename a Table in Hive

ALTER TABLE hive_table_name RENAME TO new_table_name;

Hive Join

  • Inner Join

It returns all rows of data matched in both columns

// Joining two tables on common emp_no
hive> SELECT * FROM employees e JOIN salaries s ON (e.emp_no = s.emp_no) LIMIT 5;

+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date  | emp_no | salary | from_date  | to_date    |
+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  60117 | 1986-06-26 | 1987-06-26 |
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  62102 | 1987-06-26 | 1988-06-25 |
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66074 | 1988-06-25 | 1989-06-25 |
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66596 | 1989-06-25 | 1990-06-25 |
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |  10001 |  66961 | 1990-06-25 | 1991-06-25 |
+--------+------------+------------+-----------+--------+------------+--------+--------+------------+------------+
  • Left outer Join

returns all the rows from the left table, even if there are no matches in the right table.It returns null from right table if

 SELECT * FROM employees e LEFT OUTER JOIN salaries s ON (e.emp_no = s.emp_no) LIMIT 5;

CRUD Operation in Hive

As of Hive version 0.14.0: INSERT…VALUES, UPDATE, and DELETE are now available with full ACID support.

INSERT … VALUES Syntax:

INSERT INTO TABLE <tablename> [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]

Where values_row is: ( value [, value …] ) where a value is either null or any valid SQL literal

UPDATE Syntax:

UPDATE <tablename> SET column = value [, column = value ...] [WHERE expression]

DELETE Syntax:

DELETE FROM <tablename> [WHERE expression]

Additionally, from the Hive Transactions doc:

If a table is to be used in ACID writes (insert, update, delete) then the table property “transactional” must be set on that table, starting with Hive 0.14.0. Without this value, inserts will be done in the old style; updates and deletes will be prohibited.

Retrieve Maximum Salary in Hive Table

Let us suppose we have a table employee consisting of name and salary

  • Second Highest Salary
    SELECT * 
    FROM (SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_no FROM employee GROUP BY SALARY) res
     WHERE res.row_no = 2 ;
    
  • Third Highest Salary
SELECT * FROM 
(SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_num FROM 
employee GROUP BY SALARY) res
WHERE res.row_num = 3;

preventing a large job from running for a long time

This can be done by setting the MapReduce jobs to execute in strict mode by using set hive.mapred.mode=strict; Use of this strict mode ensures that the queries on partitioned table cannot execute The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.

Order by vs cluster by vs SORT in hive ?

The ORDER BY clause is familiar from other SQL dialects. It performs a total ordering of the query result set. This means that all the data is passed through a single reducer, which may take an unacceptably long time to execute for larger data sets.

ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.

SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.

DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn’t sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.

CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

Types of UDF in Hive

  • User Designed Function*(UDF)
  • User Defined Aggregrate Functions(UDAF)
  • User Defined Table Function(UDTF)

Developing and Deploying the UDF(User Defined Function )

Build a Jar file containing UDF and add it to Hive instance to be used later.

hive> add jar customUDF.jar;
hive> create temporary function customUDF 
as 'com.nitendragautam.udf.GenericUDF';
hive> select customUDF(col1, col2, col3) from partitioned_user limit 5;

Recover Partition in Hive

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions. you need to add partition for every partition

ALTER TABLE <tablename> ADD PARTITION

Or in short you can run

MSCK REPAIR TABLE <tablename>; 

available since Hive 0.11 It will add any partitions that exist on HDFS but not in metastore to the metastore.

Copy the partition folders and data to a table folder. Create a table with dynamic partitions on the table folder. Run MCSK Repair on the new table.

Creating Indexes in Hive

An Index is a pointer on a particular column of a table. Creating an index means creating a pointer on a particular column of a table.

CREATE INDEX indexed_salary ON TABLE employee (salary) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

This creates an index to salary column .

Aggregate Function in Hive

Following are some of the aggregate functions in Hive.

count(*) ,count(DISTINCT expr[, expr_.]) ,count(expr)

Counts the row count,distinct column count

sum(col), sum(DISTINCT col)

Gives the sum of Column and sum of distinct column

min(col)

Returns the minimum of the column in the group

max(col)

Returns the maximum value of the column in the group

variance(col), var_pop(col)

Returns the variance of a numeric column in the group

var_samp(col)

Returns the unbiased sample variance of a numeric column in the group

stddev_pop(col)

Returns the standard deviation of a numeric column in the group

stddev_samp(col)

Returns the unbiased sample standard deviation of a numeric column in the group

covar_pop(col1, col2)

Returns the population covariance of a pair of numeric columns in the group

covar_samp(col1, col2)

Returns the sample covariance of a pair of a numeric columns in the group

corr(col1, col2)

Returns the Pearson coefficient of correlation of a pair of a numeric columns in the group

percentile(BIGINT col, p)

Returns the exact pth percentile of a column in the group (does not work with floating point types). p must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

percentile(BIGINT col, array(p1 [, p2]…))

Returns the exact percentiles p1, p2, … of a column in the group (does not work with floating point types). pi must be between 0 and 1. NOTE: A true percentile can only be computed for integer values. Use PERCENTILE_APPROX if your input is non-integral.

histogram_numeric(col, b)

Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights

collect_set(col)

Returns a set of objects with duplicate elements eliminated

collect_list(col)

Returns a list of objects with duplicates (as of Hive 0.13.0)

NTILE

This function divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.).

Cube Dimensions in Hive

Cubes are the physical implementations of dimensional data model. A cube captures the structure in the data model and organizes measures and dimensions in an optimal layout. Queries on cubes are highly efficient and can support online applications and dashboards.

Windowing functions in Hive

Windowing allows you to create a window on a set of data further allowing aggregation surrounding that data. Windowing in Hive is introduced from Hive 0.11.

Some analytics functions that hive provides are given below

Analytics functions

  • RANK
  • ROW_NUMBER
  • DENSE_RANK
  • CUME_DIST
  • PERCENT_RANK
  • NTILE

References

[1] Hive Confluence

[2] Partitioning and Bucketing in Hive

Share: Twitter Facebook Google+ LinkedIn
comments powered by Disqus