HiveSQL实战积累_unionall与groupingsets效率比较

1.SQL实例

groupingsets实例：

SELECT
	shop_id AS shop_id,
	COALESCE(chan_cd, 999999) AS chan_cd,
	COALESCE(stat_ct, 999999) AS stat_ct,
	SUM(cust_qty) AS cust_qty
FROM
	app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
	shop_id,
	chan_cd,
	stat_ct grouping sets((shop_id, chan_cd, stat_ct),(shop_id, chan_cd),(shop_id));

groupingsets执行计划：

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), cust_qty (type: bigint)
              outputColumnNames: shop_id, chan_cd, stat_ct, cust_qty
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(cust_qty)
                keys: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), '0' (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3, _col4
                Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: string)
                  sort order: ++++
                  Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: string)
                  Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col4 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col4
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          pruneGroupingSetId: true
          Select Operator
            expressions: _col0 (type: bigint), COALESCE(_col1,999999) (type: int), COALESCE(_col2,999999) (type: bigint), _col4 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

unionall实例：

SELECT
	shop_id,
	999999 AS chan_cd,
	999999 AS stat_ct,
	SUM(cust_qty) AS cust_qty
FROM
	app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
	shop_id

UNION ALL

SELECT
	shop_id,
	chan_cd,
	999999 AS stat_ct,
	SUM(cust_qty) AS cust_qty
FROM
	app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
	shop_id,
	chan_cd

UNION ALL

SELECT
	shop_id,
	chan_cd,
	stat_ct,
	SUM(cust_qty) AS cust_qty
FROM
	app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
	shop_id,
	chan_cd,
	stat_ct

unionall执行计划：

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
  Stage-3 is a root stage
  Stage-4 is a root stage
  Stage-0 depends on stages: Stage-2

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: shop_id (type: bigint), cust_qty (type: bigint)
              outputColumnNames: shop_id, cust_qty
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(cust_qty)
                keys: shop_id (type: bigint)
                mode: hash
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: bigint)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: bigint)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col1 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), 999999 (type: int), 999999 (type: int), _col1 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            Union
              Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
          TableScan
            Union
              Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: shop_id (type: bigint), chan_cd (type: int), cust_qty (type: bigint)
              outputColumnNames: shop_id, chan_cd, cust_qty
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(cust_qty)
                keys: shop_id (type: bigint), chan_cd (type: int)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: bigint), _col1 (type: int)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col2 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: int)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
            expressions: _col0 (type: bigint), _col1 (type: int), 999999 (type: int), _col2 (type: bigint)
            outputColumnNames: _col0, _col1, _col2, _col3
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), cust_qty (type: bigint)
              outputColumnNames: shop_id, chan_cd, stat_ct, cust_qty
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(cust_qty)
                keys: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint)
                mode: hash
                outputColumnNames: _col0, _col1, _col2, _col3
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint)
                  sort order: +++
                  Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col3 (type: bigint)
      Reduce Operator Tree:
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

2.比较结论

从执行计划中不太看得出执行效率高低，可以看出groupingsets是在一个MR中完成的，不确定不同的groupingId是并行执行的还是串行执行的，但是可以看出unionall中不同的groupby组合是并发执行的。

根据前辈的口口相传经验，unionall比groupingsets的并行度更高，效率更高。