HBase表数据倾斜治理_HBase快照映射到HDFS过程中HBase工具类解码与十六进制解码的区别

HBase表数据倾斜治理_HBase快照映射到HDFS过程中HBase工具类解码与十六进制解码的区别

1.背景

在hbase倾斜治理过程中,将hbase快照数据的读取和bulkload分成两步,第一步读取hbase数据存储到hdfs系统的text文件中,第二步再读取hdfs系统的text文件数据bulkload到hbase中。

2.HBase内置编解码算法

HBase内置的Bytes工具类中定义了对String、boolean、double、float、int、long、short七种类型数据进行编解码的方法:

  • String使用UTF-8算法进行编码
  • boolean编码为长度为1的byte[]
  • double编码为长度为8的byte[]
  • float编码为长度为4的byte[]
  • int编码为长度为4的byte[]
  • long编码为长度为8的byte[]
  • short编码为长度为2的byte[]

3.使用HBase内置算法解码

3.1读取思路

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
public class GetHbaseMapper extends TableMapper<Text, Text> {
private Text writeKey = new Text();

@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
}

@Override
protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
if (null != value) {

String res = "";

for (Cell cell : value.listCells()) {
//key和value都直接用Bytes.toString()方法进行解码。
String colName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
String colValue = Bytes.toString(cell.getValueArray(), cell.getValueOffset(), cell.getValueLength()).replaceAll("\\n|\\r|\\t|\1|\\\\N", " ");
res += colName + ":";
res += colValue + "\t";
}

writeKey.set(new Text(Bytes.toString(key.get())));

//mapper输出中key和value之间的默认分隔符就是"\t"
context.write(writeKey, new Text(res));
}
}
}
  • mapper输出数据到Text文件中时key和value之间的默认分隔符就是”\t”。
  • “\n”是换行符;”\t”是制表符,也就是Tab;”\r”是回车符;”\f”是换页符。

写Text文件的思路就是key是行首,与value之间使用制表符分隔,每个value之间也是用制表符分隔。

3.2读取结果

读取结果Text文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
UV4�%	ChangeRate: 	ImpactIndex: 	OrdCustNum: �	OrdCustNumDaliyRate: ?�Y��C��	Value: 	
UV4�% ChangeRate: ImpactIndex: OrdCustNum: OrdCustNumDaliyRate: ��7{S�.p Value:
UV4�% ChangeRate: ImpactIndex: OrdCustNum: � OrdCustNumDaliyRate: ��~��E Value:
UV4�& ChangeRate: ImpactIndex: OrdCustNum: � OrdCustNumDaliyRate: ?���iao2 Value:
UV4�& ChangeRate: ImpactIndex: OrdCustNum: OrdCustNumDaliyRate: ���,#Or� Value:
UV4�& ChangeRate: ImpactIndex: OrdCustNum: � OrdCustNumDaliyRate: ��$��� Value:
UV4�' ChangeRate: ImpactIndex: OrdCustNum: � OrdCustNumDaliyRate: ?�s� O� Value:
UV4�' ChangeRate: ImpactIndex: OrdCustNum:
OrdCustNumDaliyRate: ��N\�q Value:
UV4�' ChangeRate: ImpactIndex: OrdCustNum: � OrdCustNumDaliyRate: ��K�ڌ�= Value:
UV4�( ChangeRate: ImpactIndex: OrdCustNum: K OrdCustNumDaliyRate: ?���RG� Value:
UV4�( ChangeRate: ImpactIndex: OrdCustNum:
OrdCustNumDaliyRate: ���q�r Value:
UV4�( ChangeRate: ImpactIndex: OrdCustNum: 8 OrdCustNumDaliyRate: �����S Value:
OrdAmt4�% ChangeRate: ?��x�c{� ImpactIndex: @P�^���@ OrdCustNum: � OrdCustNumDaliyRate: ?�Y��C�� Value: AT�%<(��
OrdAmt4�% ChangeRate: ����ΐ ImpactIndex: ��E{
D�� OrdCustNum: OrdCustNumDaliyRate: ��7{S�.p Value: @yy�����
OrdAmt4�% ChangeRate: ?��7 EHE ImpactIndex: @?�?�0� OrdCustNum: � OrdCustNumDaliyRate: ��~��E Value: A'�]�=p�
OrdAmt4�& ChangeRate: ?�[]�.� ImpactIndex: @U}��a� OrdCustNum: � OrdCustNumDaliyRate: ?���iao2 Value: AY�4��G�
OrdAmt4�& ChangeRate: ��u.
��� ImpactIndex: ����s�% OrdCustNum: OrdCustNumDaliyRate: ���,#Or� Value: @��fffff
OrdAmt4�& ChangeRate: ?�sD=�/ ImpactIndex: @+����|? OrdCustNum: � OrdCustNumDaliyRate: ��$��� Value: A&��p��

OrdAmt4�' ChangeRate: ?�HF0|f� ImpactIndex: @UָH�eH OrdCustNum: � OrdCustNumDaliyRate: ?�s� O� Value: A]�f-p��
OrdAmt4�' ChangeRate: �߿��2 ImpactIndex: ����+F� OrdCustNum:
OrdCustNumDaliyRate: ��N\�q Value: @�
33332

可以看出每一行数据存在比较严重的乱码,甚至出现了乱码中存在换行符的情况,导致一行数据变成了两行。而我们还要将这个Text文件读取回写到hbase当中去,那么读取该Text文件时按照什么规则来分隔每行数据中的rowkey和每个value根本无法统一。而且之前读取hbase快照时将\n|\r|\t|\1|\N等转化为空格的逻辑也直接篡改了数据。

3.3原因分析

造成上述错误的根本原因就是,我们的预计算数据的rowkey和value中很多数据都是直接从int、long等非String类型编码成byte[]保存到hbase中。如果直接按String类型使用UTF-8算法来解码,一定会解码出很多具有其他特殊含义的特殊字符,导致解码出的数据无法按照我们预想的格式来存储。

如果存储在hbase当中的rowkey和value都是String类型的,那么我们按照上述的解码方式得到的就是本身存储的可以看出意义的原数据,就不会出现无意义的乱码和特殊字符。包括解码出来的\n|\r|\t|\1|\N等分隔符号也是我们本来就想保存成这样,那么把它们替换成空格也不会影响原数据的意义。

那么为什么我们存储预计算数据时将int、long类型数据直接编码,而不是转化成String类型再编码?第一点,当int、long类型数据较大或较长时直接编码为长度为4或8的byte[]可以节约存储;第二点,rowkey都由int、long等非String类型数据直接编码得到的byte[]组成可以实现长度固定,且使用scan进行范围搜索时非常准确,不会出现长度不一致导致的误拼接脏数据。

3.4解决方案

思路一:

可以另外存储一份hbase表的元数据信息,保存rowkey和value的字段数据类型信息。在解析读取hbase快照时获取该表的元数据信息,针对每个不同数据类型的字段使用不同的方法进行解码,这样就可以还原出rowkey和value具有意义的原数据。

思路二:

反正我们本来也不是想把数据回写到Text文件中然后直接读该该Text文件中的数据,而是要再回写到hbase当中去,所以Text文件中的数据有没有意义也不重要,只要在回写到hbase时能够被正确分隔就行。那么可以使用十六进制字符串编码格式来对hbase快照中的数据进行解码,这样解码出来的rowkey和value就全都是0~f的字符,不会存在特殊分隔符号。

4.使用十六进制解码

4.1读取思路

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
public class ReverseRowkeyMapper extends TableMapper<Text, Text> {
private String firstRowkeyType;
private Text writeKey = new Text();

@Override
protected void setup(Context context) {
Configuration configuration = context.getConfiguration();
firstRowkeyType = configuration.get("first.rowkey.type");
}

@Override
protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {

if (null != value) {

byte[] rowkey = reverseFirstField(key.get(), firstRowkeyType);
String res = "";
for (Cell cell : value.listCells()) {
String colName = Bytes.toString(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength());
byte[] colValue = new byte[cell.getValueLength()];
System.arraycopy(cell.getValueArray(), cell.getValueOffset(), colValue, 0, cell.getValueLength());

res += colName + ":";
//16进制字符串格式解码value
res += StringHexUtils.bytesToHexString(colValue) + "\t";
}

//16进制字符串格式解码rowkey
writeKey.set(new Text(StringHexUtils.bytesToHexString(rowkey)));

//mapper输出中key和value之间的默认分隔符就是"\t"
context.write(writeKey, new Text(res));
}
}

//对组成rowkey的首个字段进行reverse。
public byte[] reverseFirstField(byte[] rowkey, String firstRowkeyType) {
int firstFieldLenth = 0;
if(firstRowkeyType.toUpperCase().equals("STRING")){ //string方式需要单独解析 暂时不实现
return rowkey;
}else if(firstRowkeyType.toUpperCase().equals("INT")){ //暂时只处理int和long型
firstFieldLenth =4;
}else if(firstRowkeyType.toUpperCase().equals("LONG")){
firstFieldLenth =8;
}else {
return rowkey;
}

byte[] reverseField = new byte[firstFieldLenth];
System.arraycopy(rowkey, 0, reverseField, 0, firstFieldLenth);

if (firstFieldLenth == 4) {
int reverseInt = Integer.reverse(Bytes.toInt(reverseField));
reverseField = Bytes.toBytes(reverseInt);
}else {
long reverseLong = Long.reverse(Bytes.toLong(reverseField));
reverseField = Bytes.toBytes(reverseLong);
}

System.arraycopy(reverseField, 0, rowkey, 0, firstFieldLenth);
return rowkey;
}
}

自定义的十六进制编解码算法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
public class StringHexUtils {
/**
* 从字节数组到十六进制字符串转换
*
* @param src
* @return 十六进制字符串
*/
public static String bytesToHexString(byte[] src) {
StringBuilder stringBuilder = new StringBuilder("");
if (src == null || src.length <= 0) {
return null;
}
for (int i = 0; i < src.length; i++) {
int v = src[i] & 0xFF;
String hv = Integer.toHexString(v);
if (hv.length() < 2) {
//byte数组转换成16进制字符串会补足两位自动在前面加0
stringBuilder.append(0);
}
stringBuilder.append(hv);
}
return stringBuilder.toString();
}

/**
* 从十六进制字符串到字节数组转换
*
* @param src
* @return
*/
public static byte[] HexStringToBytes(String src) {
src = src.length() % 2 != 0 ? "0" + src : src;

byte[] b = new byte[src.length() / 2];
for (int i = 0; i < b.length; i++) {
int index = i * 2;
int v = Integer.parseInt(src.substring(index, index + 2), 16);
b[i] = (byte) v;
}
return b;
}
}

4.2读取结果

读取结果Text文件:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
19691900000000000000000255560000000001348b350000000100000001	ChangeRate:000000010000000000000000	ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000	OrdCustNumDaliyRate:000000010000000000000000	Value:000000010000000000000000	
19691900000000000000000255560000000001348b350000000100000002 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000100000003 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000100000004 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000100000005 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:00000001bff0000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000200000001 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000200000002 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000200000003 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000200000004 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b350000000200000005 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:00000001bff0000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b360000000100000001 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b360000000100000002 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b360000000100000003 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000
19691900000000000000000255560000000001348b360000000100000004 ChangeRate:000000010000000000000000 ImpactIndex:00000001000000000000000OrdCustNum:000000010000000000000000 OrdCustNumDaliyRate:000000010000000000000000 Value:000000010000000000000000

可以看出非常整齐规律,回写到hbase时使用如下mapper即可正确读写:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
public class LoadHbaseMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
private ImmutableBytesWritable writeKey = new ImmutableBytesWritable();

@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
}

@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
if (null != value) {
String line = value.toString();
String[] items = line.split("\t");

if (items.length == 0 || StringUtils.isBlank(items[0].trim())) {
return;
} else {
byte[] rowkey = StringHexUtils.HexStringToBytes(items[0]);
Put put = new Put(rowkey);

for (int i = 1; i < items.length - 1; i++) {

put.addColumn(Bytes.toBytes("d"), Bytes.toBytes(items[i].split(":")[0]), StringHexUtils.HexStringToBytes(items[i].split(":")[1]));
}

writeKey.set(rowkey);
context.write(writeKey, put);
}
}
}
}