HiveSQL实战积累_分时累计趋势图的多种实现方案与比较

HiveSQL实战积累_分时累计趋势图的多种实现方案与比较

分时累计趋势图定义

分时累计趋势图一般是指一天内的累积趋势图,与之对应的就是分时窗口趋势图。比如小时累积的24个点中第一个点就是0-1点的数据,第二个点就是0-2点的数据,每个点都是0点到该点的累积数据;小时窗口的第一个点就是0-1点的数据,第二个点就是1-2点的数据,每个点就是一个被等分的时间窗口中的聚合数据。

分时窗口趋势图比较好出数,直接按照时间前缀等方式group聚合即可,但是分时累计趋势图就比较麻烦了,总结常见的出数方式有下述三种。

流量模块预计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
SELECT
dt,
bs,
interval_index,
SUM(item_pv_10min_aggr) AS item_pv_10min_aggr,
SUM(item_uv_10min_aggr) AS item_uv_10min_aggr,
0 AS item_rt_10min_aggr,
0 AS item_login_users_10min_aggr
FROM
(
SELECT
dt,
bs,
interval_index,
SUM(item_pv_10min_aggr) over(partitionBY dt, bs orderby interval_index) AS item_pv_10min_aggr,
0 AS item_uv_10min_aggr
FROM
(
SELECT
dt,
arrayJoin(IF(bs = '7710', ['7710', '77'], [bs])) AS bs,
concat(REGEXP_REPLACE(dt, '-', ''), leftPad(toString(toInt32(toInt32(mins) / 6)), 2, '0'), leftPad(toString(10 *((mins) %6)), 2, '0')) AS interval_index,
SUM(item_pv) item_pv_10min_aggr
FROM
app.app_d14_traffic_plat_item_di_new_opt_aggr_d
WHERE
dt >= '2021-10-31'
AND dt <= '2021-10-31'
AND bs = '1302'
AND cate_bu_id = '6210'
AND mins <= 76
GROUPBY
dt,
bs,
interval_index
)
t

UNIONALL

SELECT
dt,
bs,
interval_index,
0 AS item_pv_10min_aggr,
SUM(item_uv_10min_aggr) over(partitionBY dt, bs orderby interval_index) AS item_uv_10min_aggr
FROM
(
SELECT
dt,
bs,
interval_index,
COUNT( *) AS item_uv_10min_aggr
FROM
(
SELECT
dt,
arrayJoin(IF(bs = '7710', ['7710', '77'], [bs])) AS bs,
bs_browser_uniq_id,
MIN(concat(REGEXP_REPLACE(dt, '-', ''), leftPad(toString(toInt32(toInt32(mins) / 6)), 2, '0'), leftPad(toString(10 *((mins) %6)), 2, '0'))) AS interval_index
FROM
app.app_d14_traffic_plat_item_di_new_opt_aggr_d
WHERE
dt >= '2021-10-31'
AND dt <= '2021-10-31'
AND bs = '1302'
AND cate_bu_id = '6210'
AND mins <= 76
GROUPBY
dt,
bs,
bs_browser_uniq_id
)
t
GROUPBY
dt,
bs,
interval_index
)
tt
)
ttt
GROUPBY
dt,
bs,
interval_index

上述预计算流程中,最关键的就是通过group by+min获取到每个用户的最早登陆时间,然后再通过窗口函数直接累加即可得到去重uv。

主要缺点:

  1. 计算流程复杂,对于去重和非去重指标需要分开按不同逻辑计算,存在多步聚合和指标关联操作。
  2. 存在较为明显的时间断点和uv断点,时间断点是指,某些时间区间内没有pv或uv,导致采用窗口函数累加得到的数据表中,没有该时间维度行;uv断点是指某些时间区间内有pv但是没有uv,导致累加完之后的两指标合并之后,该时间维度行的uv为0或空。这些断点在无论在数据还是服务层面都需要再关联时间维表来补全断点。

流量olap服务出数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
-- 常规写法
SELECT
replaceAll(concat(toString(dt), toString(mins)), '-', '') AS point,
uniqCombinedMerge(item_uv_aggr) over(order by dt, toInt32(mins)) AS item_uv_aggr
FROM
(
SELECT
uniqCombinedState(browser_uniq_id) AS item_uv_aggr,
mins AS mins,
dt AS dt
FROM
app.app_d14_traffic_plat_item_di_new_opt_aggr_d
WHERE
dt > '2022-04-16'
AND dt < '2022-04-17'
GROUP BY
mins,
dt
)

-- 所有指标开一样窗的简略开窗写法
SELECT
*
FROM
(
SELECT
shop,
ckdt,
cate_2,
cate_3,
SUM(m16077) OVER w AS m16077,
uniqMerge(m16078) OVER w AS m16078,
SUM(m16119) OVER w AS m16119,
uniqMerge(m16177) OVER w AS m16177,
uniqMerge(m16399) OVER w AS m16399,
uniqMerge(m19903) OVER w AS m19903,
uniqMerge(m19904) OVER w AS m19904
FROM
(
SELECT
formatDateTime(parseDateTimeBestEffortOrNull(t.request_tm, 'UTC'), '%Y-%m-%d %H:00') AS ckdt,
t.item_second_cate_cd AS cate_2,
t.item_third_cate_cd AS cate_3,
t.shop_id AS shop,
SUM(1) AS m16077,
uniqState(t.browser_uniq_id) AS m16078,
SUM(t.stm_rt) AS m16119,
uniqState(t.sku_id) AS m16177,
uniqState(t.item_id) AS m16399,
uniqState(t.shop_id) AS m19903,
uniqState(t.main_brand_cd) AS m19904
FROM
sz.app_jdr_traffic_sz_item_i_d_d_inoutday_inpop_d AS t
WHERE
t.request_tm >= '2025-01-13 20:00:00'
AND dt >= formatDateTime(toDateTime(subStr('2025-01-13 20:00:00', 1, 10)), '%Y-%m-%d')
AND t.request_tm <= '2025-01-14 23:59:59'
AND dt <= formatDateTime(toDateTime(subStr('2025-01-14 23:59:59', 1, 10)), '%Y-%m-%d')
GROUP BY
formatDateTime(parseDateTimeBestEffortOrNull(t.request_tm, 'UTC'), '%Y-%m-%d %H:00'),
t.item_second_cate_cd,
t.item_third_cate_cd,
t.shop_id
) AS t WINDOW w AS (partition by cate_2, cate_3, shop order by ckdt ASC rows between unbounded preceding and CURRENT ROW
) SETTINGS allow_experimental_window_functions = 1
) t

依赖clickhouse引擎的uniqState或者uniqCombinedState函数生成bitmap先计算分时聚合数据,然后再用开窗函数做分时累计。

主要缺点:

  1. 需要一层嵌套子查询。
  2. hive中不支持去重计数开窗函数,clickhouse利用了bitmap存储去重数据才支持的。
  3. 同样存在时间断点问题,需要在服务层中再关联时间维表补全断点。

交易模块预计算

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
SELECT
t.dt dt,
time_granularity,
t.shop_id shop_id,
SUM(t.deal_ord_amt) deal_amt,
COUNT(DISTINCT t.sku_id) sku_num
FROM
app.app_zh_d0705t04_pre_sale_ord_di t --测试表
lateral view explode(split('00:00,01:00,02:00,03:00,04:00,05:00,06:00,07:00,08:00,09:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00',',')) time_granularity time_granularity
WHERE
t.dt = '2023-07-11'
AND t.shop_id = '77339'
AND t.ord_deal_flag = 1
AND CAST(concat(SUBSTR(TRIM(CAST(t.ord_deal_tm AS string)), 12, 3), '00') AS string) <= time_granularity
GROUP BY
t.dt,
time_granularity,
t.shop_id

将明细数据关联时间维表或侧视图,然后通过where条件过滤出各个累计时间段内的明细数据。

主要缺点:

  1. 明细数据爆炸造成计算压力较大,加重数据倾斜,在十分和小时情况下数据分别膨胀144倍和24倍还能接受,但如果是分钟粒度或者秒级粒度数据量难以接受。

通用数据服务生产链路中暂定以lateral view关联时间维表的方式支持十分累计、小时累计两种预计算粒度。