Introduction to Hive Query Language

29 December 2017

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query and analysis. Hive Query Language is SQL like query language which is used to query data from hive tables.

Create Database

First create the database testdb if does not exists and list the databases.

-- Create Database if does not exists
CREATE DATABASE IF NOT EXISTS testdb;

-- Use the newly craete Database
USE testdb;

-- List all the databases
SHOW DATABASES;
Create Database

Upload Data in HDFS

Upload the drivers.csv and ‘'’timesheet data into HDFS(/user/maria-dev/) from the below data source

Data Source

Now Upload the Files

Upload File 1st Step
Upload File 2nd Step

Create Internal Table for Drivers Data

As database is recenly created ,no tables exists in the new database. Following query will result in a empty result.

-- Show all the Tables
SHOW TABLES;
Show Tables

Now Create Table using the Hive DDL statement.

-- Create Table drivers if does not exists

CREATE TABLE IF NOT EXISTS drivers_int(driverId INT ,name STRING ,ssn BIGINT, location STRING ,certified STRING ,wageplan STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');

SHOW TABLES;

In this DDL statement notice the folloing things

  • FIELDS TERMINATED BY ','
    It means thad the data for this table will be seerated by comma or the data will be in csv format

  • STORED AS TEXTFILE

  • ROW FORMAT DELIMITED

  • TBLPROPERTIES

Create Tables

Describe Tables Schema

-- Describes Table
DESCRIBE EXTENDED drivers_int;

Describe Table Schema

Load the data by overwriting the existing data

-- Insert the drivers_int table with the drivers.csv data by overwriting existing data 
LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE drivers_int;

--List the data from Tables
SELECT * FROM temp_drivers LIMIT 10; 
Load Table Data

After executing LOAD DATA we can see table drivers_int was populated with data from drivers.csv.

Note that Hive consumed the data file drivers.csv during this step. If you look in the File Browser you will see drivers.csv is no longer there.

As we are creating internal table ,data from HDFS will be moved to Hive ware house.

Create table driver time sheet


-- Create Table TimeSheet
CREATE TABLE IF NOT EXISTS timesheet_int( driverId INT ,week INT ,hours_logged INT ,miles_loggged INT)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');


SHOW TABLES;
Create Table Time Sheet

Load CSV data into Table time sheet

-- Load data into Time Sheet Table
LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE timesheet_int;

-- List the data from TimeSheet Table
SELECT * FROM timesheet_int LIMIT 10;
Load Time Sheet Data

Find the sum of hours and miles logged score for an year fro each driver

SELECT driverId ,sum(hours_logged) ,sum(miles_loggged) FROM timesheet_int GROUP BY driverId LIMIT 10;
Driver yearly hours and miles sum

Join the two tables to get the drivers Name for aggregrated data


-- Join the two tables to get the aggregrated hours and miles data along with the respective driver name

SELECT d.driverId, d.name ,t.total_hours ,t.total_miles
FROM drivers_int d
JOIN
(SELECT driverId ,sum(hours_logged) total_hours ,sum(miles_loggged) total_miles
 FROM timesheet_int
 GROUP BY driverId) t
 
 ON (d.driverId = t.driverId)
 LIMIT 10;
Join two Tables

Reference

[1] Data Processing with Hive

Share: Twitter Facebook Google+ LinkedIn
comments powered by Disqus