HiveSQL实战积累_groupby&unionall与leftjoin效率比较

HiveSQL实战积累_groupby&unionall与leftjoin效率比较

1.比较sql实例

分别依赖流量明细、关注明细、加购明细三张源表,加工得到店铺粒度的uv、pv、关注、加购四个指标数据。使用groupby&unionall和leftjoin两种写法得到的结果数据是一样的。

1.1 groupby&unionall

使用union all合并不同源表加工数据然后group by sum聚合到一起的sql实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
SELECT
a.shop_id,
SUM(a.uv) uv,
SUM(a.pv) pv,
SUM(a.concern) concern,
SUM(a.cart) cart
FROM
(
SELECT
shop_id,
COUNT(DISTINCT brws_uniq_id) uv,
COUNT(brws_uniq_id) pv,
0 concern,
0 cart
FROM
adm.adm_sch_d14_zs_all_chan_shop_traffics_di
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
GROUP BY
shop_id

UNION ALL

SELECT
shop_id,
0 uv,
0 pv,
COUNT(DISTINCT itm_sku_id) concern,
0 cart
FROM
gdm.gdm_m13_zs_shop_product_follow_det
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
GROUP BY
shop_id

UNION ALL

SELECT
shop_id,
0 uv,
0 pv,
0 concern,
SUM(add_pro_qty) cart
FROM
gdm.gdm_m14_zs_shop_product_add_cart_det
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
AND add_cart_typ = 1
GROUP BY
shop_id
)
a
GROUP BY
a.shop_id

1.2 leftjoin

使用full join合并不同源表加工数据的sql实例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
SELECT
COALESCE(a.shop_id, b.shop_id, c.shop_id) shop_id,
a.uv uv,
a.pv pv,
b.concern concern,
c.cart cart
FROM
(
SELECT
shop_id,
COUNT(DISTINCT brws_uniq_id) uv,
COUNT(brws_uniq_id) pv
FROM
adm.adm_sch_d14_zs_all_chan_shop_traffics_di
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
GROUP BY
shop_id
)
a
FULL JOIN
(
SELECT
shop_id,
COUNT(DISTINCT itm_sku_id) concern
FROM
gdm.gdm_m13_zs_shop_product_follow_det
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
GROUP BY
shop_id
)
b
ON
a.shop_id = b.shop_id
FULL JOIN
(
SELECT
shop_id,
SUM(add_pro_qty) cart
FROM
gdm.gdm_m14_zs_shop_product_add_cart_det
WHERE
dt = '2023-03-18'
AND itm_sku_id > 0
AND add_cart_typ = 1
GROUP BY
shop_id
)
c
ON
a.shop_id = c.shop_id

2.explain执行计划比较

2.1 groupby&unionall

groupby&unionall执行计划:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 depends on stages: Stage-2

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: adm_sch_d14_zs_all_chan_shop_traffics_di
Statistics: Num rows: 3215094512 Data size: 372950963416 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((dt = '2023-03-18') and (itm_sku_id > 0)) (type: boolean)
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), brws_uniq_id (type: string)
outputColumnNames: shop_id, brws_uniq_id
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT brws_uniq_id), count(brws_uniq_id)
keys: shop_id (type: bigint), brws_uniq_id (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0), count(VALUE._col1)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 535849085 Data size: 62158493864 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: bigint), UDFToLong(0) (type: bigint), 0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 535849085 Data size: 62158493864 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Union
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col1), sum(_col2), sum(_col3), sum(_col4)
keys: _col0 (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
TableScan
Union
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col1), sum(_col2), sum(_col3), sum(_col4)
keys: _col0 (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
TableScan
Union
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(_col1), sum(_col2), sum(_col3), sum(_col4)
keys: _col0 (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 555371524 Data size: 70055386943 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint), _col3 (type: bigint), _col4 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0), sum(VALUE._col1), sum(VALUE._col2), sum(VALUE._col3)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 277685762 Data size: 35027693471 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 277685762 Data size: 35027693471 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: gdm_m13_zs_shop_product_follow_det
Statistics: Num rows: 6867732 Data size: 2916252794 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((dt = '2023-03-18') and (itm_sku_id > 0)) (type: boolean)
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), itm_sku_id (type: bigint)
outputColumnNames: shop_id, itm_sku_id
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT itm_sku_id)
keys: shop_id (type: bigint), itm_sku_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: bigint)
sort order: ++
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1144622 Data size: 486042132 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), UDFToLong(0) (type: bigint), UDFToLong(0) (type: bigint), _col1 (type: bigint), 0 (type: int)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 1144622 Data size: 486042132 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: gdm_m14_zs_shop_product_add_cart_det
Statistics: Num rows: 220533812 Data size: 88930214596 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((dt = '2023-03-18') and (itm_sku_id > 0)) and (add_cart_typ = 1)) (type: boolean)
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), add_pro_qty (type: bigint)
outputColumnNames: shop_id, add_pro_qty
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(add_pro_qty)
keys: shop_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 18377817 Data size: 7410850947 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), UDFToLong(0) (type: bigint), UDFToLong(0) (type: bigint), UDFToLong(0) (type: bigint), _col1 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 18377817 Data size: 7410850947 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

2.2 leftjoin

fulljoin执行计划:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
Explain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1, Stage-3, Stage-4
Stage-3 is a root stage
Stage-4 is a root stage
Stage-0 depends on stages: Stage-2

STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: adm_sch_d14_zs_all_chan_shop_traffics_di
Statistics: Num rows: 3215094512 Data size: 372950963416 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((dt = '2023-03-18') and (itm_sku_id > 0)) (type: boolean)
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), brws_uniq_id (type: string)
outputColumnNames: shop_id, brws_uniq_id
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT brws_uniq_id), count(brws_uniq_id)
keys: shop_id (type: bigint), brws_uniq_id (type: string)
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 1071698170 Data size: 124316987728 Basic stats: COMPLETE Column stats: NONE
value expressions: _col3 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0), count(VALUE._col1)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 535849085 Data size: 62158493864 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 535849085 Data size: 62158493864 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint), _col2 (type: bigint)
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 1144622 Data size: 486042132 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
TableScan
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 18377817 Data size: 7410850947 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
Outer Join 0 to 2
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
2 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
Statistics: Num rows: 1178868012 Data size: 136748689464 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: COALESCE(_col0,_col3,_col5) (type: bigint), _col1 (type: bigint), _col2 (type: bigint), _col4 (type: bigint), _col6 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4
Statistics: Num rows: 1178868012 Data size: 136748689464 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1178868012 Data size: 136748689464 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: gdm_m13_zs_shop_product_follow_det
Statistics: Num rows: 6867732 Data size: 2916252794 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: ((dt = '2023-03-18') and (itm_sku_id > 0)) (type: boolean)
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), itm_sku_id (type: bigint)
outputColumnNames: shop_id, itm_sku_id
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(DISTINCT itm_sku_id)
keys: shop_id (type: bigint), itm_sku_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint), _col1 (type: bigint)
sort order: ++
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 2289244 Data size: 972084264 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Group By Operator
aggregations: count(DISTINCT KEY._col1:0._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1144622 Data size: 486042132 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-4
Map Reduce
Map Operator Tree:
TableScan
alias: gdm_m14_zs_shop_product_add_cart_det
Statistics: Num rows: 220533812 Data size: 88930214596 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((dt = '2023-03-18') and (itm_sku_id > 0)) and (add_cart_typ = 1)) (type: boolean)
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: shop_id (type: bigint), add_pro_qty (type: bigint)
outputColumnNames: shop_id, add_pro_qty
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: sum(add_pro_qty)
keys: shop_id (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 36755635 Data size: 14821702298 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 18377817 Data size: 7410850947 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink

3.比较结论

比较上述两个sql的执行计划,还有将fulljoin换成leftjoin或innerjoin的执行计划,可以发现groupby&unionall与各种join的执行计划都是一样的,多个unionall的sql是并行,多个left的sql也是并行的,所以在效率上其实是差不多的。

那么在实际项目中具体采用哪种写法,主要考虑的就是key的包含关系,以上述例子中的流量数据和关注数据为例:如果存在无流量但是有关注店铺,也存在有流量但是无关注店铺,而最终我们希望只要有流量或者关注其一的店铺就进入结果表,那么这种场景下使用groupby&unionall会方便快捷很多。如果我们能确认有流量店铺的范围一定完全包含了有关注店铺,那么使用groupby&unionall或者流量结果leftjoin关注结果都可以,只不过leftjoin写法会更简洁一些。