Skip to content

Instantly share code, notes, and snippets.

@kzhangkzhang
Last active April 13, 2024 03:27
Show Gist options
  • Select an option

  • Save kzhangkzhang/258d18858889fa97194011a249b74c43 to your computer and use it in GitHub Desktop.

Select an option

Save kzhangkzhang/258d18858889fa97194011a249b74c43 to your computer and use it in GitHub Desktop.

Revisions

  1. kzhangkzhang revised this gist Mar 2, 2022. 1 changed file with 37 additions and 0 deletions.
    37 changes: 37 additions & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -432,6 +432,43 @@ set hive.exec.max.dynamic.partitions=1000;
    set hive.exec.max.dynamic.partitions.pernode=1000;
    ```

    ### Hive ACID Transactions: Insert/Update/Delete

    Pre-Requisites:

    - Hive Transactions Manager should be set to DbTxnManager SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
    - We need to enable concurrency SET hive.support.concurrency=true;
    - Once we set the above properties, we should be able to insert data into any table.
    - For updates and deletes, table should be bucketed and file format need to be ORC or any ACID Compliant Format.
    - We also need to set table property transactions to true TBLPROPERTIES ('transactional'='true');

    ```sql
    CREATE TABLE orders_transactional (
    order_id INT,
    order_date STRING,
    order_customer_id INT,
    order_status STRING
    ) CLUSTERED BY (order_id) INTO 8 BUCKETS
    STORED AS orc
    TBLPROPERTIES('transactional' = 'true');

    INSERT INTO orders_transactional VALUES
    (1, '2013-07-25 00:00:00.0', 1000, 'COMPLETE');

    INSERT INTO orders_transactional VALUES
    (2, '2013-07-25 00:00:00.0', 2001, 'CLOSED'),
    (3, '2013-07-25 00:00:00.0', 1500, 'PENDING');

    UPDATE orders_transactional
    SET order_status = 'COMPLETE'
    WHERE order_status = 'PENDING';

    DELETE FROM orders_transactional
    WHERE order_status <> 'COMPLETE';

    SELECT * FROM orders_transactional;
    ```

    ### Select Statement

    #### Distinct
  2. kzhangkzhang revised this gist Feb 28, 2022. 1 changed file with 274 additions and 237 deletions.
    511 changes: 274 additions & 237 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,43 @@
    # Hive Syntax Cheat Sheet
    # Big Data Cheat Sheet

    Support\Big Data\BigData_CheatSheet.md

    # HDFS

    ```bash
    -- list files in HDFS
    hadoop fs -ls <PATH_OF_DIRECTORY>

    -- creating directories in HDFS
    hadoop fs -mkdir <PATH_OF_DIRECTORY>

    -- creating multiple level directories in HDFS
    hadoop fs -mkdir -p <DIR1/DIR2/DIR3>

    -- copy files from local file system into HDFS
    hadoop fs -copyFromLocal <LOCAL_PATH> <HDFS_PATH>
    hadoop fs -put <LOCAL_PATH> <HDFS_PATH>

    -- copy file from HDFS to local file system
    hadoop fs -copyToLocal <HDFS_PATH> <LOCAL_PATH>
    hadoop fs -get <HDFS_PATH> <LOCAL_PATH>

    -- copy or move file from one HDFS location to another HDFS location
    hadoop fs -cp <SR_HDFS_PATH> <TARGET_HDFS_PATH>
    hadoop fs -mv <SR_HDFS_PATH> <TARGET_HDFS_PATH>

    -- delete file without move to trash
    hadoop fs -rm -r -f -skipTrash <HDFS_PATH>

    -- get file system usage
    hadoop fs -du -s -h <HDFS_PATH>
    hadoop fs -df -h <HDFS_PATH>

    -- get metadata of files in HDFS
    hdfs fsck <HDFS_PATH> -files -blocks -locations
    ```

    # HIVE

    ## General rule

    @@ -11,51 +50,54 @@
    ### Primitive Data Types

    - Numeric
    - TINYINT, SMALLINT, INT, BIGINT
    - FLOAT
    - DOUBLE
    - DECIMAL

    - TINYINT, SMALLINT, INT, BIGINT
    - FLOAT
    - DOUBLE
    - DECIMAL

    - Date/Time
    - TIMESTAMP
    - Strings must be in format "YYYY-MM-DD HH:MM:SS.fffffffff"
    - Interger types as UNIX timestamp in seconds from UNIX epoch (1-JAN-1970 00:00:00)
    - Floating point types same as Integer with decimal precision
    - DATE

    - TIMESTAMP
    - Strings must be in format "YYYY-MM-DD HH:MM:SS.fffffffff"
    - Interger types as UNIX timestamp in seconds from UNIX epoch (1-JAN-1970 00:00:00)
    - Floating point types same as Integer with decimal precision
    - DATE

    - String Data Types
    - STRING
    - VARCHAR
    - CHAR

    - STRING
    - VARCHAR
    - CHAR

    - Misc.
    - BOOLEAN
    - BINARY
    - BINARY is an array of Bytes and similar to VARBINARY in many RDBMSs. BINARY columns are stored within the record, not separately like BLOBs . We can include arbitrary bytes in BINARY column and these bytes are not parsed by Hive as numbers or strings.
    - BOOLEAN
    - BINARY
    - BINARY is an array of Bytes and similar to VARBINARY in many RDBMSs. BINARY columns are stored within the record, not separately like BLOBs . We can include arbitrary bytes in BINARY column and these bytes are not parsed by Hive as numbers or strings.

    ### Complex/Collection Types

    |Type|Syntax|
    |---|---|
    |Arrays|ARRAY<data_type>|
    |Maps|MAP<primitive_type,data_type>|
    |Struc|STRUCT<col_name:data_type [COMMENT col_comment], ...>|
    |Union Type|UNIONTYPE<data_type, data_type, ...>|
    |||
    | Type | Syntax |
    | ---------- | ----------------------------------------------------- |
    | Arrays | ARRAY<data_type> |
    | Maps | MAP<primitive_type,data_type> |
    | Struc | STRUCT<col_name:data_type [COMMENT col_comment], ...> |
    | Union Type | UNIONTYPE<data_type, data_type, ...> |
    | | |

    - Default delimeters for the fields in collection data types

    |Delimiter|Code|Description|
    |---|---|---|
    |\n|\n|Record or row delimiter|
    |^A (Ctrl + A)|Field delimiter|Field delimiter|
    |^B (Ctrl + B)|\002|Element delimiter in ARRAY and STRUCTs|
    |^C (Ctrl + C)|\003|Delimits key/value pairs in a MAP|
    ||||
    | Delimiter | Code | Description |
    | ------------- | --------------- | -------------------------------------- |
    | \n | \n | Record or row delimiter |
    | ^A (Ctrl + A) | Field delimiter | Field delimiter |
    | ^B (Ctrl + B) | \002 | Element delimiter in ARRAY and STRUCTs |
    | ^C (Ctrl + C) | \003 | Delimits key/value pairs in a MAP |
    | | | |

    - Example

    ```
    ```sql
    CREATE TABLE movies (
    movie_name string,
    participants ARRAY<string>,
    @@ -75,14 +117,15 @@ SELECT
    FROM
    movies;
    ```

    - sample output:

    |x|x|x|x|x|
    |---|---|---|---|---|
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
    |"Planes"|2013-08-09 00:00:00|91505|"Green"|{2:[1.0,2.3,5.6]}|
    | x | x | x | x | x |
    | ----------- | ------------------- | ----- | ------------ | ----------------- |
    | "Inception" | 2010-07-16 00:00:00 | 91505 | "Dark Green" | {0:800} |
    | "Planes" | 2013-08-09 00:00:00 | 91505 | "Green" | {2:[1.0,2.3,5.6]} |

    ```
    ```sql
    CREATE TABLE user (
    name STRING,
    id BIGINT,
    @@ -95,20 +138,18 @@ CREATE TABLE user (
    others UNIONTYPE<FLOAT,BOOLEAN,STRING>,
    misc BINARY
    )
    ROW FORMAT DELIMITED
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    COLLECTION ITEMS TERMINATED BY '\002'
    MAP KEYS TERMINATED BY '\003'
    LINES TERMINATED BY '\n';
    ```

    ## Basic Syntax

    ### Database
    ## Database

    - Create database

    ```
    ```sql
    CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT some_comment]
    [LOCATION hdfs_path]
    @@ -117,26 +158,27 @@ CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

    - Use database

    ```
    ```sql
    USE db_name;
    ```

    - Drop database

    ```
    ```sql
    DROP (DATABASE|SCHEMA) [IF EXISTS] database_name;
    ```

    - Show database

    ```
    ```sql
    SHOW DATABASES;
    ```

    ### Table
    ## Table

    - Create table
    ### Create table

    ```
    ```sql
    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data type [COMMENT col_comment], ...)]
    [PARTITIONED BY (col_name data type [COMMENT col_comment], ...)]
    @@ -145,289 +187,286 @@ CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [LOCATION hdfs_path]
    [TBLPROPERTIES (property_name=property_value, ...)];

    External Example 1:
    -- External Example 1:

    CREATE EXTERNAL TABLE users (
    user_id INT,
    age INT,
    gender STRING,
    occupation STRING,
    zip_code STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION '/pluralsight/userinfo'
    ;
    ==> For external table, you still can directly insert data into it via "INSERT INTO TABLE ..." syntax. This is same as managed table: You can use "INSERT INTO TABLE ..." to insert data to __both__ managed table and external table
    Create table **AS** Example 1:
    CREATE TABLE
    occupation_count STORED AS RCFile
    AS
    SELECT
    COUNT(*), occupation
    FROM users
    GROUP BY
    occupation;
    ==> new table will have data
    Create table **LIKE** Example 1:
    CREATE TABLE
    occupation2
    LIKE occupation_count;
    ==> It will create a new table with no records (empty)
    ==> It is similar to Oracle "CREATE TABLE y AS SELECT * FROM y WHERE 1=0"
    ```
    CREATE EXTERNAL TABLE users (
    user_id INT,
    age INT,
    gender STRING,
    occupation STRING,
    zip_code STRING
    )
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION '/pluralsight/userinfo'
    ;

    -- For external table, you still can directly insert data into it via "INSERT INTO TABLE ..." syntax.
    -- This is same as managed table: You can use "INSERT INTO TABLE ..." to insert data to both managed table and external table

    -- Create table **AS** Example 1
    -- New table will have data
    CREATE TABLE
    occupation_count STORED AS RCFile
    AS
    SELECT
    COUNT(*), occupation
    FROM users
    GROUP BY
    occupation;

    - Truncate table
    - there is no "IF NOT EXISTS" as part of truncate syntax

    ```
    TRUNCATE TABLE table_name [PARTITION partition_spec];
    -- Create table **LIKE** Example 1:
    -- It will create a new table with no records (empty)
    -- It is similar to Oracle
    -- "CREATE TABLE y AS SELECT *FROM y WHERE 1=0"
    CREATE TABLE
    occupation2
    LIKE occupation_count;

    ```

    ### Truncate table

    - Drop table
    - there is no "IF NOT EXISTS" as part of truncate syntax
    - can only truncate managed table
    - can't truncate external table

    ```
    DROP TABLE IF EXISTS [db_name.]table_name;
    ```sql
    TRUNCATE TABLE table_name [PARTITION partition_spec];
    ```

    - Show table in hive database
    ### Drop table

    ```sql
    DROP TABLE IF EXISTS [db_name.]table_name;
    ```
    SHOW TABLES [IN database_name] ['identifier_with_wildcards'];

    example:
    ### Show table

    SHOW TABLES "*example*";
    ```sql
    SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
    ```

    - Describe table
    ### Describe table

    ```
    ```sql
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```

    ### Show table DDL script

    - Show table DDL script

    ```
    ```sql
    SHOW CREATE TABLE [db_name.]table_name;
    ```

    - Show execution plan for Hive query
    ### Show execution plan for Hive query

    ```
    EXPLAIN query;
    ```sql
    EXPLAIN query;
    ```

    ---

    ### Load Data into table

    - Load data from local file system
    - Load data from HDFS location (files in the source location will be **deleted**). One should have write permissions on HDFS location.
    - By default data will be appended, to overwrite we can use OVERWRITE clause.

    ### Load Data into table
    ```sql
    -- from local file system
    LOAD DATA LOCAL INPATH '/data/retail_db/orders' INTO TABLE orders;

    - ```LOAD PATH INPATH [path]```
    - Move data if source is HDFS
    - Copies data if source is LOCAL
    - ```LOAD DATA LOCAL INPATH [path]```
    -- from HDFS location
    dfs -mkdir /user/training/retail_db;
    dfs -put /data/retail_db/orders /user/training/retail_db/.;
    LOAD DATA INPATH '/user/training/retail_db/orders' INTO TABLE orders;

    ---
    -- Overwrite existing data
    LOAD DATA LOCAL INPATH '/data/retail_db/orders' OVERWRITE INTO TABLE orders;
    ```

    ### Partitioning in Hive
    ## Partitioning in Hive

    - Advantages

    Partitioning is used for distributing execution load horizontally.
    As the data is stored as slices/parts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
    For example, In a large user table where the table is partitioned by country, then selecting users of country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.
    Partitioning is used for distributing execution load horizontally.
    As the data is stored as slices/parts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
    For example, In a large user table where the table is partitioned by country, then selecting users of country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.

    - Limitations

    Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to NameNode since it must keep all metadata for the file system in memory only.

    Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses.
    Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to NameNode since it must keep all metadata for the file system in memory only.

    In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.
    Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses.

    In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.

    ### Managed Partitioned Tables

    #### Managed Partitioned Tables

    ```
    CREATE TABLE table_name (col_name_1 data_type_1, ....)
    ```sql
    CREATE TABLE table_name (col_name_1 data_type_1, ....)
    PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);

    - Example
    CREATE TABLE partitioned_user(
    firstname VARCHAR(64),
    lastname VARCHAR(64),
    address STRING,
    city VARCHAR(64),
    post STRING,
    phone1 VARCHAR(64),
    phone2 STRING,
    email STRING,
    web STRING
    )
    PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
    STORED AS SEQUENCEFILE;
    -- Note that we didn’t include country and state columns in table definition
    -- but included in partition definition.
    CREATE TABLE partitioned_user(
    firstname VARCHAR(64),
    lastname VARCHAR(64),
    address STRING,
    city VARCHAR(64),
    post STRING,
    phone1 VARCHAR(64),
    phone2 STRING,
    email STRING,
    web STRING
    )
    PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
    STORED AS SEQUENCEFILE;

    ```

    ==> Note that we didn’t include country and state columns in table definition but included in partition definition.

    #### External Partitioned Tables
    ### External Partitioned Tables

    CREATE EXTERNAL TABLE table_name (col_name_1 data_type_1, ....)
    ```sql
    CREATE EXTERNAL TABLE table_name (col_name_1 data_type_1, ....)
    PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
    ```

    We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table.

    #### Inserting Data Into Partitioned Tables
    - We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement
    - but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table.

    ### Inserting Data Into Partitioned Tables

    Data insertion into partitioned tables can be done in two modes.

    - Static Partitioning
    - Dynamic Partitioning

    - Static Partition in Hive
    #### Static Partition in Hive

    In this mode, input data should contain the columns listed only in table definition but not the columns defined in partitioned by clause.

    1. Loading Data into Managed Partitioned Table From Local FS

    ```sql
    -- Example
    LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
    INTO TABLE partitioned_user
    PARTITION (country = 'US', state = 'CA');

    In this mode, input data should contain the columns listed only in table definition but not the columns defined in partitioned by clause.
    -- This will create separate directory under the default warehouse directory in HDFS.

    1. Loading Data into Managed Partitioned Table From Local FS
    -- /user/hive/warehouse/partitioned_user/country=US/state=CA/

    ```
    -- Example
    hive> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
    INTO TABLE partitioned_user
    PARTITION (country = 'US', state = 'CA');
    -- Similarly we have to add other partitions, which will create corresponding directories in HDFS. Or else we can load the entire directory into Hive table with single command and can add partitions for each file with ALTER command.

    This will create separate directory under the default warehouse directory in HDFS.
    LOAD DATA LOCAL INPATH '${env:HOME}/inputdir'
    INTO TABLE partitioned_user;

    /user/hive/warehouse/partitioned_user/country=US/state=CA/
    Similarly we have to add other partitions, which will create corresponding directories in HDFS. Or else we can load the entire directory into Hive table with single command and can add partitions for each file with ALTER command.
    ```

    hive> LOAD DATA LOCAL INPATH '${env:HOME}/inputdir'
    INTO TABLE partitioned_user;
    ```
    ##### Loading Partition From Other Table

    1. Loading Partition From Other Table
    ```sql

    ```
    -- Example
    hive> INSERT OVERWRITE TABLE partitioned_user
    PARTITION (country = 'US', state = 'AL')
    SELECT * FROM another_user au
    WHERE au.country = 'US' AND au.state = 'AL';
    ```
    INSERT OVERWRITE TABLE partitioned_user
    PARTITION (country = 'US', state = 'AL')
    SELECT * FROM another_user au
    WHERE au.country = 'US' AND au.state = 'AL';
    ```

    ==> Overwriting Existing Partition
    ##### Overwriting Existing Partition

    We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.
    We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.

    1. Loading Data into External Partitioned Table from HDFS
    1. Loading Data into External Partitioned Table from HDFS

    There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.
    There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.

    If our files are on Local FS, they can be moved to a directory in HDFS and we can add partition for each file in that directory with commands similar to below.
    If our files are on Local FS, they can be moved to a directory in HDFS and we can add partition for each file in that directory with commands similar to below.

    ```
    hive> ALTER TABLE partitioned_user ADD PARTITION (country = 'US', state = 'CA')
    LOCATION '/hive/external/tables/user/country=us/state=ca'
    ```
    ```sql
    hive> ALTER TABLE partitioned_user ADD PARTITION (country = 'US', state = 'CA')
    LOCATION '/hive/external/tables/user/country=us/state=ca'
    ```

    Similarly we need to repeat the above alter command for all partition files in the directory so that a meta data entry will be created in metastore, mapping the partition and table.
    Similarly we need to repeat the above alter command for all partition files in the directory so that a meta data entry will be created in metastore, mapping the partition and table.

    - Dynamic Partitioning in Hive
    #### Dynamic Partitioning in Hive

    Instead of loading each partition with single SQL statement as shown above, which will result in writing lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can add any number of partitions with single SQL execution. Hive will automatically splits our data into separate partition files based on the values of partition keys present in the input files.
    Instead of loading each partition with single SQL statement as shown above, which will result in writing lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can add any number of partitions with single SQL execution. Hive will automatically splits our data into separate partition files based on the values of partition keys present in the input files.

    For dynamic partition loading we will not provide the values for partition keys
    For dynamic partition loading we will not provide the values for partition keys

    ```
    - Example
    hive>INSERT INTO TABLE partitioned_user
    PARTITION (country, state)
    SELECT
    firstname ,
    lastname ,
    address ,
    city ,
    post ,
    phone1 ,
    phone2 ,
    email ,
    web ,
    country ,
    state
    FROM temp_user;
    ```
    ```sql

    hive>INSERT INTO TABLE partitioned_user
    PARTITION (country, state)
    SELECT
    firstname ,
    lastname ,
    address ,
    city ,
    post ,
    phone1 ,
    phone2 ,
    email ,
    web ,
    country ,
    state
    FROM temp_user;
    ```

    ==> We can also __mix__ dynamic and static partitions by specifying it as PARTITION(country = ‘US’, state). But __static partition keys must come before__ the dynamic partition keys.
    - We can also **mix** dynamic and static partitions by specifying it as PARTITION(country = ‘US’, state). But **static partition keys must come before** the dynamic partition keys.

    ==> But by default, Dynamic Partitioning is disabled in Hive to prevent accidental partition creations. To use dynamic partitioning we need to set below properties either in Hive Shell or in hive-site.xml file.
    - But by default, Dynamic Partitioning is disabled in Hive to prevent accidental partition creations. To use dynamic partitioning we need to set below properties either in Hive Shell or in hive-site.xml file.

    [Please check here for dynamic partition settings](https://gist.github.com/kzhangkzhang/7c33f779452348222561a31bd6faabcd)
    [Please check here for dynamic partition settings](https://gist.github.com/kzhangkzhang/7c33f779452348222561a31bd6faabcd)

    We can set these through hive shell with below commands:
    We can set these through hive shell with below commands:

    ```
    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.exec.max.dynamic.partitions=1000;
    set hive.exec.max.dynamic.partitions.pernode=1000;
    ```
    ---
    ```sql
    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.exec.max.dynamic.partitions=1000;
    set hive.exec.max.dynamic.partitions.pernode=1000;
    ```

    ### Select Statement

    - Distinct
    #### Distinct

    ```
    SELECT DISTINCT
    ```sql
    SELECT DISTINCT
    col1,
    col2,
    col3
    FROM
    some_table;
    ```

    - Aliasing
    #### Aliasing

    ```
    ```sql
    SELECT
    col1 + col2 AS col3
    FROM
    some_table;
    ```

    - REGEX Column Specification
    #### REGEX Column Specification

    ```
    SELECT
    ```sql
    SELECT
    '(ID|Name)?+.+'
    FROM
    some_table;
    ```

    - Inerchangeable constructs
    #### Interchangeable constructs

    ```
    SELECT
    ```sql
    SELECT
    col1,
    col2,
    col3
    @@ -443,7 +482,7 @@ above is same as below

    FROM
    some_table
    SELECT
    SELECT
    col1,
    col2,
    col3
    @@ -454,10 +493,10 @@ LIMIT

    ```

    - Sub queries & Union
    #### Sub queries & Union

    ```
    SELECT
    ```sql
    SELECT
    t3.mycol
    FROM
    (
    @@ -480,18 +519,16 @@ JOIN t4 on (t4.col_x = t3.mycol);

    - execute host command within hive shell

    ```
    ```shell
    hive>!host_command;

    for example:

    hive>!clear;
    hive>!pwd;
    hive>~hdfs dfs -ls;
    hive>!clear;
    hive>!pwd;
    hive>~hdfs dfs -ls;
    ```

    - show all Hive setting with hive shell

    ```
    ```shell
    hive>set;
    ```
    ```
  3. kzhangkzhang revised this gist Nov 7, 2018. 1 changed file with 11 additions and 1 deletion.
    12 changes: 11 additions & 1 deletion BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -213,14 +213,24 @@ example:
    ```
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```
    ****


    - Show table DDL script

    ```
    SHOW CREATE TABLE [db_name.]table_name;
    ```

    - Show execution plan for Hive query

    ```
    EXPLAIN query;
    ```

    ---



    ### Load Data into table

    - ```LOAD PATH INPATH [path]```
  4. kzhangkzhang revised this gist Nov 7, 2018. 1 changed file with 9 additions and 2 deletions.
    11 changes: 9 additions & 2 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -160,6 +160,8 @@ External Example 1:
    LOCATION '/pluralsight/userinfo'
    ;
    ==> For external table, you still can directly insert data into it via "INSERT INTO TABLE ..." syntax. This is same as managed table: You can use "INSERT INTO TABLE ..." to insert data to __both__ managed table and external table
    Create table **AS** Example 1:
    CREATE TABLE
    occupation_count STORED AS RCFile
    @@ -211,7 +213,12 @@ example:
    ```
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```
    ****
    - Show table DDL script

    ```
    SHOW CREATE TABLE [db_name.]table_name;
    ```
    ---

    ### Load Data into table
    @@ -307,7 +314,7 @@ Data insertion into partitioned tables can be done in two modes.
    ```
    2. Loading Partition From Other Table
    1. Loading Partition From Other Table
    ```
    -- Example
    @@ -321,7 +328,7 @@ Data insertion into partitioned tables can be done in two modes.
    We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.
    3. Loading Data into External Partitioned Table from HDFS
    1. Loading Data into External Partitioned Table from HDFS
    There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.
  5. kzhangkzhang revised this gist Nov 6, 2018. 1 changed file with 70 additions and 9 deletions.
    79 changes: 70 additions & 9 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -244,7 +244,7 @@ DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    #### Managed Partitioned Tables

    ```
    CREATE [EXTERNAL] TABLE table_name (col_name_1 data_type_1, ....)
    CREATE TABLE table_name (col_name_1 data_type_1, ....)
    PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
    - Example
    @@ -261,21 +261,23 @@ PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
    email STRING,
    web STRING
    )
    PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
    STORED AS SEQUENCEFILE;
    PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
    STORED AS SEQUENCEFILE;
    ```

    ==> Note that we didn’t include country and state columns in table definition but included in partition definition.

    #### External Partitioned Tables

    CREATE EXTERNAL TABLE table_name (col_name_1 data_type_1, ....)
    PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);

    We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table.

    #### Inserting Data Into Partitioned Tables



    Data insertion into partitioned tables can be done in two modes.

    - Static Partitioning
    @@ -290,29 +292,88 @@ Data insertion into partitioned tables can be done in two modes.
    ```
    -- Example
    hive> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
    INTO TABLE partitioned_user
    PARTITION (country = 'US', state = 'CA');
    INTO TABLE partitioned_user
    PARTITION (country = 'US', state = 'CA');
    This will create separate directory under the default warehouse directory in HDFS.
    /user/hive/warehouse/partitioned_user/country=US/state=CA/
    Similarly we have to add other partitions, which will create corresponding directories in HDFS. Or else we can load the entire directory into Hive table with single command and can add partitions for each file with ALTER command.
    hive> LOAD DATA LOCAL INPATH '${env:HOME}/inputdir'
    INTO TABLE partitioned_user;
    ```
    2. Loading Partition From Other Table
    ```
    -- Example
    hive> INSERT OVERWRITE TABLE partitioned_user
    hive> INSERT OVERWRITE TABLE partitioned_user
    PARTITION (country = 'US', state = 'AL')
    SELECT * FROM another_user au 
    SELECT * FROM another_user au
    WHERE au.country = 'US' AND au.state = 'AL';
    ```
    ```
    ==> Overwriting Existing Partition
    We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.
    3. Loading Data into External Partitioned Table from HDFS
    There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.
    If our files are on Local FS, they can be moved to a directory in HDFS and we can add partition for each file in that directory with commands similar to below.
    ```
    hive> ALTER TABLE partitioned_user ADD PARTITION (country = 'US', state = 'CA')
    LOCATION '/hive/external/tables/user/country=us/state=ca'
    ```
    Similarly we need to repeat the above alter command for all partition files in the directory so that a meta data entry will be created in metastore, mapping the partition and table.
    - Dynamic Partitioning in Hive
    Instead of loading each partition with single SQL statement as shown above, which will result in writing lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can add any number of partitions with single SQL execution. Hive will automatically splits our data into separate partition files based on the values of partition keys present in the input files.
    For dynamic partition loading we will not provide the values for partition keys
    ```
    - Example
    hive>INSERT INTO TABLE partitioned_user
    PARTITION (country, state)
    SELECT
    firstname ,
    lastname ,
    address ,
    city ,
    post ,
    phone1 ,
    phone2 ,
    email ,
    web ,
    country ,
    state
    FROM temp_user;
    ```
    ==> We can also __mix__ dynamic and static partitions by specifying it as PARTITION(country = ‘US’, state). But __static partition keys must come before__ the dynamic partition keys.
    ==> But by default, Dynamic Partitioning is disabled in Hive to prevent accidental partition creations. To use dynamic partitioning we need to set below properties either in Hive Shell or in hive-site.xml file.
    [Please check here for dynamic partition settings](https://gist.github.com/kzhangkzhang/7c33f779452348222561a31bd6faabcd)
    We can set these through hive shell with below commands:
    ```
    set hive.exec.dynamic.partition=true;
    set hive.exec.dynamic.partition.mode=nonstrict;
    set hive.exec.max.dynamic.partitions=1000;
    set hive.exec.max.dynamic.partitions.pernode=1000;
    ```
    ---
    ### Select Statement
  6. kzhangkzhang revised this gist Nov 6, 2018. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -76,6 +76,7 @@ FROM
    movies;
    ```
    - sample output:

    |x|x|x|x|x|
    |---|---|---|---|---|
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
  7. kzhangkzhang revised this gist Nov 6, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -76,7 +76,7 @@ FROM
    movies;
    ```
    - sample output:
    ||||||
    |x|x|x|x|x|
    |---|---|---|---|---|
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
    |"Planes"|2013-08-09 00:00:00|91505|"Green"|{2:[1.0,2.3,5.6]}|
  8. kzhangkzhang revised this gist Nov 5, 2018. 1 changed file with 96 additions and 0 deletions.
    96 changes: 96 additions & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -211,13 +211,109 @@ example:
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```

    ---

    ### Load Data into table

    - ```LOAD PATH INPATH [path]```
    - Move data if source is HDFS
    - Copies data if source is LOCAL
    - ```LOAD DATA LOCAL INPATH [path]```

    ---

    ### Partitioning in Hive

    - Advantages

    Partitioning is used for distributing execution load horizontally.
    As the data is stored as slices/parts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
    For example, In a large user table where the table is partitioned by country, then selecting users of country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.

    - Limitations

    Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to NameNode since it must keep all metadata for the file system in memory only.

    Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses.

    In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.



    #### Managed Partitioned Tables

    ```
    CREATE [EXTERNAL] TABLE table_name (col_name_1 data_type_1, ....)
    PARTITIONED BY (col_name_n data_type_n [COMMENT col_comment], ...);
    - Example
    CREATE TABLE partitioned_user(
    firstname VARCHAR(64),
    lastname VARCHAR(64),
    address STRING,
    city VARCHAR(64),
    post STRING,
    phone1 VARCHAR(64),
    phone2 STRING,
    email STRING,
    web STRING
    )
    PARTITIONED BY (country VARCHAR(64), state VARCHAR(64))
    STORED AS SEQUENCEFILE;
    ```

    ==> Note that we didn’t include country and state columns in table definition but included in partition definition.

    #### External Partitioned Tables

    We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table.

    #### Inserting Data Into Partitioned Tables



    Data insertion into partitioned tables can be done in two modes.

    - Static Partitioning
    - Dynamic Partitioning

    - Static Partition in Hive

    In this mode, input data should contain the columns listed only in table definition but not the columns defined in partitioned by clause.

    1. Loading Data into Managed Partitioned Table From Local FS

    ```
    -- Example
    hive> LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
    INTO TABLE partitioned_user
    PARTITION (country = 'US', state = 'CA');
    This will create separate directory under the default warehouse directory in HDFS.
    /user/hive/warehouse/partitioned_user/country=US/state=CA/
    ```
    2. Loading Partition From Other Table
    ```
    -- Example
    hive> INSERT OVERWRITE TABLE partitioned_user
    PARTITION (country = 'US', state = 'AL')
    SELECT * FROM another_user au 
    WHERE au.country = 'US' AND au.state = 'AL';
    ```
    ==> Overwriting Existing Partition
    We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.
    ---
    ### Select Statement
    - Distinct
  9. kzhangkzhang revised this gist Nov 5, 2018. 1 changed file with 39 additions and 5 deletions.
    44 changes: 39 additions & 5 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -23,22 +23,36 @@
    - Floating point types same as Integer with decimal precision
    - DATE

    - String Data Types
    - STRING
    - VARCHAR
    - CHAR

    - Misc.
    - BOOLEAN
    - STRING
    - BINARY
    - CHAR
    - BINARY is an array of Bytes and similar to VARBINARY in many RDBMSs. BINARY columns are stored within the record, not separately like BLOBs . We can include arbitrary bytes in BINARY column and these bytes are not parsed by Hive as numbers or strings.

    ### Complex/Collection Types

    |Type|Syntax|
    |----|------|
    |---|---|
    |Arrays|ARRAY<data_type>|
    |Maps|MAP<primitive_type,data_type>|
    |Struc|STRUCT<col_name:data_type [COMMENT col_comment], ...>|
    |Union Type|UNIONTYPE<data_type, data_type, ...>|
    |||

    - Default delimeters for the fields in collection data types

    |Delimiter|Code|Description|
    |---|---|---|
    |\n|\n|Record or row delimiter|
    |^A (Ctrl + A)|Field delimiter|Field delimiter|
    |^B (Ctrl + B)|\002|Element delimiter in ARRAY and STRUCTs|
    |^C (Ctrl + C)|\003|Delimits key/value pairs in a MAP|
    ||||

    - Example

    ```
    @@ -60,11 +74,31 @@ SELECT
    misc
    FROM
    movies;
    ```
    - sample output:
    |--|--|--|--|--|
    ||||||
    |---|---|---|---|---|
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
    |"Planes"|2013-08-09 00:00:00|91505|"Green"|{2:[1.0,2.3,5.6]}|

    ```
    CREATE TABLE user (
    name STRING,
    id BIGINT,
    isFTE BOOLEAN,
    role VARCHAR(64),
    salary DECIMAL(8,2),
    phones ARRAY<INT>,
    deductions MAP<CHAR, FLOAT>,
    address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>,
    others UNIONTYPE<FLOAT,BOOLEAN,STRING>,
    misc BINARY
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
    COLLECTION ITEMS TERMINATED BY '\002'
    MAP KEYS TERMINATED BY '\003'
    LINES TERMINATED BY '\n';
    ```

    ## Basic Syntax
  10. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 1 addition and 0 deletions.
    1 change: 1 addition & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -62,6 +62,7 @@ FROM
    movies;
    - sample output:
    |--|--|--|--|--|
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
    |"Planes"|2013-08-09 00:00:00|91505|"Green"|{2:[1.0,2.3,5.6]}|
    ```
  11. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 61 additions and 1 deletion.
    62 changes: 61 additions & 1 deletion BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -6,6 +6,66 @@
    - hive is case sensitive
    - secmicolon to terminate statements

    ## Hive Data Types

    ### Primitive Data Types

    - Numeric
    - TINYINT, SMALLINT, INT, BIGINT
    - FLOAT
    - DOUBLE
    - DECIMAL

    - Date/Time
    - TIMESTAMP
    - Strings must be in format "YYYY-MM-DD HH:MM:SS.fffffffff"
    - Interger types as UNIX timestamp in seconds from UNIX epoch (1-JAN-1970 00:00:00)
    - Floating point types same as Integer with decimal precision
    - DATE

    - Misc.
    - BOOLEAN
    - STRING
    - BINARY
    - CHAR

    ### Complex/Collection Types

    |Type|Syntax|
    |----|------|
    |Arrays|ARRAY<data_type>|
    |Maps|MAP<primitive_type,data_type>|
    |Struc|STRUCT<col_name:data_type [COMMENT col_comment], ...>|
    |Union Type|UNIONTYPE<data_type, data_type, ...>|
    |||

    - Example

    ```
    CREATE TABLE movies (
    movie_name string,
    participants ARRAY<string>,
    release_dates MAP<string, timestamp>,
    studio_addr STRUCT<state:string, city:string, zip:string>,
    complex_participants MAP<string, STRUCT<address:string, attributes MAP<string, string>>>,
    misc UNIONTYPE <int, string, ARRAY<double>>
    );
    SELECT
    movie_name,
    participants[0],
    release_dates["USA"],
    studio_addr.zip,
    complex_participants["Leonardo DiCaprio"].attributes["fav_color"],
    misc
    FROM
    movies;
    - sample output:
    |"Inception"|2010-07-16 00:00:00|91505|"Dark Green"|{0:800}|
    |"Planes"|2013-08-09 00:00:00|91505|"Green"|{2:[1.0,2.3,5.6]}|
    ```

    ## Basic Syntax

    ### Database
    @@ -75,7 +135,7 @@ Create table **AS** Example 1:
    occupation;
    ==> new table will have data
    Create table **LIKE** Example 1:
    CREATE TABLE
  12. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 20 additions and 1 deletion.
    21 changes: 20 additions & 1 deletion BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -64,7 +64,26 @@ External Example 1:
    LOCATION '/pluralsight/userinfo'
    ;
    Create table **AS** Example 1:
    CREATE TABLE
    occupation_count STORED AS RCFile
    AS
    SELECT
    COUNT(*), occupation
    FROM users
    GROUP BY
    occupation;
    ==> new table will have data
    Create table **LIKE** Example 1:
    CREATE TABLE
    occupation2
    LIKE occupation_count;
    ==> It will create a new table with no records (empty)
    ==> It is similar to Oracle "CREATE TABLE y AS SELECT * FROM y WHERE 1=0"
    ```

    - Truncate table
  13. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 24 additions and 0 deletions.
    24 changes: 24 additions & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -48,6 +48,23 @@ CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [STORED AS file_format]
    [LOCATION hdfs_path]
    [TBLPROPERTIES (property_name=property_value, ...)];
    External Example 1:
    CREATE EXTERNAL TABLE users (
    user_id INT,
    age INT,
    gender STRING,
    occupation STRING,
    zip_code STRING
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
    STORED AS TEXTFILE
    LOCATION '/pluralsight/userinfo'
    ;
    ```

    - Truncate table
    @@ -80,6 +97,13 @@ example:
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```

    ### Load Data into table

    - ```LOAD PATH INPATH [path]```
    - Move data if source is HDFS
    - Copies data if source is LOCAL
    - ```LOAD DATA LOCAL INPATH [path]```

    ### Select Statement

    - Distinct
  14. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 65 additions and 10 deletions.
    75 changes: 65 additions & 10 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    # Hive Syntax Cheat Sheet

    ## Syntax general rule
    ## General rule

    - interchangeable constructs
    - hive is case sensitive
    @@ -14,9 +14,9 @@

    ```
    CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT some_comment]
    [LOCATION hdfs_path]
    [WITH DBPROPERTIES(property_name=property_value, ...)];
    [COMMENT some_comment]
    [LOCATION hdfs_path]
    [WITH DBPROPERTIES(property_name=property_value, ...)];
    ```

    - Use database
    @@ -30,22 +30,55 @@ USE db_name;
    ```
    DROP (DATABASE|SCHEMA) [IF EXISTS] database_name;
    ```
    - Show database

    ```
    SHOW DATABASES;
    ```

    ### Table

    - Create table

    ```
    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data type [COMMENT col_comment], ...)]
    [PARTITIONED BY (col_name data type [COMMENT col_comment], ...)]
    [ROW FORMATTED row_format]
    [STORED AS file_format]
    [LOCATION hdfs_path]
    [TBLPROPERTIES (property_name=property_value, ...)];
    [(col_name data type [COMMENT col_comment], ...)]
    [PARTITIONED BY (col_name data type [COMMENT col_comment], ...)]
    [ROW FORMATTED row_format]
    [STORED AS file_format]
    [LOCATION hdfs_path]
    [TBLPROPERTIES (property_name=property_value, ...)];
    ```

    - Truncate table
    - there is no "IF NOT EXISTS" as part of truncate syntax

    ```
    TRUNCATE TABLE table_name [PARTITION partition_spec];
    ```


    - Drop table

    ```
    DROP TABLE IF EXISTS [db_name.]table_name;
    ```

    - Show table in hive database

    ```
    SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
    example:
    SHOW TABLES "*example*";
    ```

    - Describe table

    ```
    DESCRIBE [EXTENDED|FORMATTED] [db_name.]table_name;
    ```

    ### Select Statement

    @@ -126,4 +159,26 @@ FROM
    another_table
    ) t3
    JOIN t4 on (t4.col_x = t3.mycol);
    ```

    ## Miscellaneous

    ### Hive Shell

    - execute host command within hive shell

    ```
    hive>!host_command;
    for example:
    hive>!clear;
    hive>!pwd;
    hive>~hdfs dfs -ls;
    ```

    - show all Hive setting with hive shell

    ```
    hive>set;
    ```
  15. kzhangkzhang revised this gist Nov 4, 2018. 1 changed file with 56 additions and 3 deletions.
    59 changes: 56 additions & 3 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -8,10 +8,43 @@

    ## Basic Syntax

    ### Database Creation
    ### Database

    - Create database

    ```
    CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
    [COMMENT some_comment]
    [LOCATION hdfs_path]
    [WITH DBPROPERTIES(property_name=property_value, ...)];
    ```

    - Use database

    ```
    USE db_name;
    ```

    - Drop database

    ```
    DROP (DATABASE|SCHEMA) [IF EXISTS] database_name;
    ```

    ### Table Creation

    ### Table

    - Create table

    ```
    CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
    [(col_name data type [COMMENT col_comment], ...)]
    [PARTITIONED BY (col_name data type [COMMENT col_comment], ...)]
    [ROW FORMATTED row_format]
    [STORED AS file_format]
    [LOCATION hdfs_path]
    [TBLPROPERTIES (property_name=property_value, ...)];
    ```


    ### Select Statement
    @@ -45,7 +78,7 @@ FROM
    some_table;
    ```

    - INerchangeable constructs
    - Inerchangeable constructs

    ```
    SELECT
    @@ -73,4 +106,24 @@ WHERE
    LIMIT
    number_of_records;
    ```

    - Sub queries & Union

    ```
    SELECT
    t3.mycol
    FROM
    (
    SELECT
    col_a + col_b AS mycol
    FROM
    some_table
    UNION ALL
    SELECT
    col_y AS mycol
    FROM
    another_table
    ) t3
    JOIN t4 on (t4.col_x = t3.mycol);
    ```
  16. kzhangkzhang created this gist Nov 4, 2018.
    76 changes: 76 additions & 0 deletions BIGDATA_HIVE_Syntax.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,76 @@
    # Hive Syntax Cheat Sheet

    ## Syntax general rule

    - interchangeable constructs
    - hive is case sensitive
    - secmicolon to terminate statements

    ## Basic Syntax

    ### Database Creation


    ### Table Creation


    ### Select Statement

    - Distinct

    ```
    SELECT DISTINCT
    col1,
    col2,
    col3
    FROM
    some_table;
    ```

    - Aliasing

    ```
    SELECT
    col1 + col2 AS col3
    FROM
    some_table;
    ```

    - REGEX Column Specification

    ```
    SELECT
    '(ID|Name)?+.+'
    FROM
    some_table;
    ```

    - INerchangeable constructs

    ```
    SELECT
    col1,
    col2,
    col3
    FROM
    some_table
    WHERE
    where_condition
    LIMIT
    number_of_records;
    above is same as below
    FROM
    some_table
    SELECT
    col1,
    col2,
    col3
    WHERE
    where_condition
    LIMIT
    number_of_records;
    ```