HiveSQL实战积累_unionall与groupingsets效率比较

HiveSQL实战积累_unionall与groupingsets效率比较

1.SQL实例

groupingsets实例:

1
2
3
4
5
6
7
8
9
10
11
SELECT
shop_id AS shop_id,
COALESCE(chan_cd, 999999) AS chan_cd,
COALESCE(stat_ct, 999999) AS stat_ct,
SUM(cust_qty) AS cust_qty
FROM
app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
shop_id,
chan_cd,
stat_ct grouping sets((shop_id, chan_cd, stat_ct),(shop_id, chan_cd),(shop_id));

groupingsets执行计划:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), cust_qty (type: bigint)
outputColumnNames: shop_id, chan_cd, stat_ct, cust_qty
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Group By Operator
aggregations: sum(cust_qty)
keys: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), '0' (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: string)
sort order: ++++
Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: string)
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
value expressions: _col4 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint), KEY._col3 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col4
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
pruneGroupingSetId: true
Select Operator
expressions: _col0 (type: bigint), COALESCE(_col1,999999) (type: int), COALESCE(_col2,999999) (type: bigint), _col4 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

unionall实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
SELECT
shop_id,
999999 AS chan_cd,
999999 AS stat_ct,
SUM(cust_qty) AS cust_qty
FROM
app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
shop_id

UNION ALL

SELECT
shop_id,
chan_cd,
999999 AS stat_ct,
SUM(cust_qty) AS cust_qty
FROM
app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
shop_id,
chan_cd

UNION ALL

SELECT
shop_id,
chan_cd,
stat_ct,
SUM(cust_qty) AS cust_qty
FROM
app.app_zs_z0404_shop_all_chan_cust_feature_analysis_test
GROUP BY
shop_id,
chan_cd,
stat_ct

unionall执行计划:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 depends on stages: Stage-2

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), cust_qty (type: bigint)
outputColumnNames: shop_id, cust_qty
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Group By Operator
aggregations: sum(cust_qty)
keys: shop_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), 999999 (type: int), 999999 (type: int), _col1 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Union
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableScan
Union
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint), _col3 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 3 Data size: 0 Basic stats: PARTIAL Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), chan_cd (type: int), cust_qty (type: bigint)
outputColumnNames: shop_id, chan_cd, cust_qty
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Group By Operator
aggregations: sum(cust_qty)
keys: shop_id (type: bigint), chan_cd (type: int)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: int)
sort order: ++
Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
value expressions: _col2 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint), KEY._col1 (type: int)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: int), 999999 (type: int), _col2 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: app_zs_z0404_shop_all_chan_cust_feature_analysis_test
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint), cust_qty (type: bigint)
outputColumnNames: shop_id, chan_cd, stat_ct, cust_qty
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Group By Operator
aggregations: sum(cust_qty)
keys: shop_id (type: bigint), chan_cd (type: int), stat_ct (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint)
sort order: +++
Map-reduce partition columns: _col0 (type: bigint), _col1 (type: int), _col2 (type: bigint)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint), KEY._col1 (type: int), KEY._col2 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

2.比较结论

从执行计划中不太看得出执行效率高低,可以看出groupingsets是在一个MR中完成的,不确定不同的groupingId是并行执行的还是串行执行的,但是可以看出unionall中不同的groupby组合是并发执行的。

根据前辈的口口相传经验,unionall比groupingsets的并行度更高,效率更高。