Hive-based Enterprise Data Warehouse(EDW)

Image result for Hive-based EDW

Prerequisite

  1. Establishing a complete Hadoop ecosystem such as using Ambari.

    For how to establish a complete Hadoop ecosystem by Ambari, you can check the posts - “Step by Step Tutorial for Ambari Installation” or “Tutorial for Latest Ambari(2.7.1)”.

  2. Installing Hive for enterprise data warehouse (EDW).

  3. Offering visual Hive query tools such as Hue.

  4. Installing NiFi for processing data flow itinerantly.

Technical Architecture

Cloud-based EDW

Business Requirements

  1. Sync order data from external MySQL database into Hive daily
  2. Compute order amount by orderID daily
  3. Compute total order amount daily
  4. Sync total order amount to external MySQL database daily

Preparations

  1. Creating databases and tables for Hive of Hadoop ecosystem and MySQL of the external database separately.
  2. Inserting several order data into the external MySQL database for syncing.

Design of a series of NiFi flow

2019-12-06 at 13.27

Firstly, we can create a processor group to package overall processors for easy management.

2019-12-06 at 10.00

Based on business requirements, I split requirements into main three tasks - syncing data from the external database, computing and analyzing data and syncing data to the external database.

2019-12-06 at 10.01

The image above shows the main four NiFi processor groups which contain a series of NiFi processor and I will illustrate all four NiFi processor groups separately.

NiFi Group 1 : Group_SynOrdersFromMySQL

2019-12-06 at 10.07

This flow which consists of processor ReplaceText, processor PutHive3QL, processor ExecuteSQL, and processor PutHive3Streaming shows how to grab data from the external MySQL database. Before showing the configuration of each processor, I want to shows a chart to explain the tasks of this flow.

2019-12-06 at 10.21

Next, I will state the configuration of each processor in each task step by step.

  1. Dropping Hive table partition which belongs to that day.

    This task contains two processors - ReplaceText and PutHive3QL.

    1. Processor ReplaceText

      This processor can replace the identified text segment to an alternative one. But, why we need this processor in task 1? Because the next processor does not provide a SQL input source, so we need a container for loading SQL to the Hive processor.

      Adding drop SQL to blank “Replacement Value”.

      2019-12-06 at 10.30

      Drop SQL:

      1
      ALTER TABLE s_shop_sephora_order_delta_orc DROP IF EXISTS PARTITION(dt='${now():format('yyyyMMdd')}');

      Using built-in code ${now():format(‘yyyyMMdd’)} to get current day code.

    2. Processor PutHive3QL

      PutHive3QL can connect to Hive and execute SQL generated from the previous processor.

      2019-12-06 at 10.38

      Clicking the right arrow in the pic above to configure the Hive3ConnectionPool.

      If you do not have that connection pool, you need to create a new one.

      2019-12-06 at 10.40

      The configuration of this connection pool shows blow.

      2019-12-06 at 10.40

  2. Querying order data from the external database.

    Using processor ExecuteSQL and configuration shows below.

    2019-12-06 at 10.51

    And DBConnectionPool shows below.

    2019-12-06 at 10.54

    Querying SQL:

    1
    select * from test.sephora_order where 1=1
  3. Saving data to a specific Hive table.

    Using processor PutHive3Streaming for inserting dataset to a specific Hive table.

    2019-12-06 at 10.56

    Configuring AvroReader because of the type of the previous dataset.

    2019-12-06 at 10.58

Until now, we can test the whole flow for accuracy.

NiFi Group 2 : Group_ComputePaymentNumberByOrder

2019-12-06 at 11.03

In this flow, we just use the processor to execute Hive SQL for computing data or analyzing data.

  1. Processor ReplaceText

    The same with the previous processor in the last task, this processor just used for storing SQL.

    2019-12-06 at 11.08

    Computing SQL:

    1
    2
    insert overwrite table default.temp_order_result_orc partition(dt='${now():format('yyyyMMdd')}')
    select order_num,province,city,region,address,(cast(order_amount AS FLOAT) + 1) from default.s_shop_sephora_order_delta_orc where dt='${now():format('yyyyMMdd')}';
  2. Processor PutHive3QL

    2019-12-06 at 11.11

NiFi Group 3 : Group_ConvergePaymentNumByOrder

2019-12-06 at 11.12

In this step, we need to compute the total order amount.

  1. Processor ReplaceText

    The Configuration shows below.

    2019-12-06 at 11.14

    Computing SQL:

    1
    2
    insert overwrite table t_order_convergence_orc partition(dt='${now():format('yyyyMMdd')}')
    select order_num,sum(total) from temp_order_result_orc where dt='${now():format('yyyyMMdd')}' group by order_num;
  2. Processor PutHive3QL

    The configuration shows below.

    2019-12-06 at 11.15

NiFi Group 4 : Group_SynTotalPaymentNum2MySQL

This is the last step for this whole POC and the overview flow chart shows below.

2019-12-06 at 11.18

In this step, I split into five processors for querying result data from the Hive table and saving them to the specific external MySQL table. Next, I will state the function and the configuration of each processor respectively.

  1. Processor SelectHive3QL

    Querying analysis result data from the Hive table and qualifying output format “Avro”.

    2019-12-06 at 11.30

    2019-12-06 at 11.30

    Since we only have two output formats - Avro and CSV, we can convert Avro to SQL successively.

    Querying Data:

    1
    select order_num_str,total,CURRENT_TIMESTAMP as update_time from default.t_order_convergence_orc where dt = '${now():format('yyyyMMdd')}'
  2. Processor SplitAvro

    In this processor, we need to split the Avro dataset because the next processors cannot process the dataset.

    2019-12-06 at 11.45

  3. Processor ConvertAvroToJSON

    Before converting to SQL, we need to convert to JSON firstly.

    2019-12-06 at 11.47

  4. Processor ConvertJSONToSQL

    Specifying table name of the external MySQL database in this processor.

    2019-12-06 at 11.47

  5. Processor PutSQL

2019-12-06 at 12.46

Summary

To sum up, we can use NiFi as a scheduling tool to process data itinerantly since NiFi provides abundant processors for distinct processing.

For now, we can run and test this whole processor group.