ClickHouse_ClickHouse服务中间态&本地去重优化实战

ClickHouse_ClickHouse服务中间态&本地去重优化实战

一、ck函数原理

1.中间态函数原理

uniqState()

uniqState(element):该函数用于将传入的元素转化为去重中间聚合状态AggregateFunction state,就是将去重指标的字段元素以bitmap的形式存储起来,以便在此基础上再次进行聚合计算。一般输入就是string类型,输出就是AggregateFunction中间态类型。

uniqMergeState()

uniqMergeState(states):该函数用于将通过uniqState函数获得的多个中间状态合并为一个中间状态,就是再次去重再以bitmap的形式存储起来。输入是AggregateFunction中间态类型,输出也是AggregateFunction中间态类型。

uniqMerge()

uniqMerge(state):该函数用于合并由 uniqState() 产生的多个聚合状态,并计算出最终的去重计数结果,就是输出最终的去重结果数。输入是AggregateFunction中间态类型,输出通常是UInt64类型。

中间态字段类型示例

1
2
3
4
5
6
7
CREATE TABLE bc_online.ck_zh_industry_product_union_all_agg
(
`dateTime` String,
`skuId` String,
`browseAcct` AggregateFunction(uniq, Nullable(String))
)
ENGINE = ...

2.distributed_group_by_no_merge原理

ck的settings distributed_group_by_no_merge=1用于加在查询分布式表聚合语句后面,作用是使在每个节点上直接对本地表执行聚合语句,然后直接把聚合结果返回给命令下发节点,然后命令下发节点直接把每个节点的聚合结果union all在一起,返回给用户,不会再进行聚合计算。

用法示例:

1
2
3
4
5
SELECT
A,COUNT() as cnt
FROM
table
GROUP BY A settings distributed_group_by_no_merge = 1

二、借助中间态+本地去重优化实战

1.优化思路

常见的一个ck查询分布式表去重指标sql如下:

1
2
3
4
5
SELECT
shop_id,uniq(browseAcct) as uv
FROM
table_d
GROUP BY shop_id

命令下发节点在收到这个sql之后会将扫描本地表table的命令下发到各个节点,然后每个节点会把browseAcct列所有行的数据都返回给命令下发节点,由下发节点统一进行去重计数。

借助中间态,先在本地节点将browseAcct的所有数据进行一次去重,再传输给命令下发节点进行整体去重,这样可以减少数据传输量,而且将去重计算进行了分散,在大部分场景下可以提升查询效率。使用中间态+本地去重的优化sql如下:

1
2
3
4
5
6
7
8
9
10
11
12
select 
shop_id,uniqMerge(uv) as uv
from
(
SELECT
shop_id,uniqState(browseAcct) as uv
FROM
table_d
GROUP BY shop_id settings distributed_group_by_no_merge = 1
)
group by
shop_id

2.优化实战

实战案例还是行业模块的一个sql的优化,优化前sql如下,执行时长1.6s:

(注意:此处bc_online.ck_zh_industry_product_union_all_agg_d表中存储的browseAcct字段本身就是一个上卷中间态字段)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
select
BrandName,
UV,
PV,
DealAmt,
DealProNum,
rowNumberInAllBlocks()+ 1 AS RankRound,
BrandId
from
(
SELECT
dictGet('bc_online.mainbrand_dic','mainGroupName',sonBrandId) AS BrandName,
dictGet('bc_online.mainbrand_dic','mainBrandId',sonBrandId) AS BrandId,
uniqMerge(browseAcct) AS UV,
sum(pv) AS PV,
sum(dealAmt) AS DealAmt,
sum(dealProNum) AS DealProNum,
FROM
bc_online.ck_zh_industry_product_union_all_agg_d
WHERE
dateTime >= '2022-10-12' AND dateTime <= '2022-11-15' AND secondIndId IN ('1343') AND shopType IN ('POP', 'B2C')
GROUP BY
BrandId,
BrandName
ORDER BY
DealAmt DESC
LIMIT 10
);

此处bc_online.ck_zh_industry_product_union_all_agg_d表中存储的browseAcct字段就是一个上卷中间态字段,所以此处使用uniqMergeState函数来做中间态聚合为中间态,优化后sql执行时长1.9s。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
SELECT
Brandname,
uniqMerge(UV) AS UV,
sum(PV) AS PV,
sum(DealAmt) AS DealAmt,
sum(DealProNum) AS DealProNum,
row_number() over(order by DealAmt desc) as RankRound,
BrandId
FROM
(
SELECT
dictGet('bc_online.mainbrand_dic', 'mainBrandId', sonBrandId) AS BrandId,
dictGet('bc_online.mainbrand_dic', 'mainGroupName', sonBrandId) AS Brandname,
uniqMergeState(browseAcct) AS UV,
sum(pv) AS PV,
sum(dealProNum) AS DealProNum,
sum(dealAmt) AS DealAmt
FROM bc_online.ck_zh_industry_product_union_all_agg_d
WHERE (dateTime >= '2022-10-12') AND (dateTime <= '2022-11-15') AND (secondIndId IN ('1343')) AND (shopType IN ('POP', 'B2C'))
GROUP BY
BrandId,
Brandname
SETTINGS distributed_group_by_no_merge = 1
)
GROUP BY
BrandId,
Brandname
ORDER BY DealAmt DESC
LIMIT 10;

比较优化前后发现执行时长反而更长了,说明中间态+本地去重的优化方式在很多场景下还是不太有效的,都是得搬运大量数据到命令下发节点做最终去重。