大模型_SuperSonic项目本地实战调试笔记

大模型_SuperSonic项目本地实战调试笔记

一、相关工程

https://github.com/tencentmusic/supersonic.git

二、环境配置

本地运行OLLAMA并配置大模型

Model Name:qwen:0.5b

Base URL:http://127.0.0.1:11434/

配置OLLAMA运行的Embedding向量模型

Model Name:mofanke/dmeta-embedding-zh:latest

三、调试解读

1.周期调度加载HanLP词典

com.tencent.supersonic.headless.server.service.impl.DictTaskServiceImpl#dailyDictTask

com.tencent.supersonic.headless.server.task.DictionaryReloadTask#reloadKnowledge

2.周期调度加载Embedding向量量库

com.tencent.supersonic.headless.server.task.MetaEmbeddingTask#reloadMetaEmbedding

3.对话框指标维度中文名自动检索补全

com.tencent.supersonic.chat.server.rest.ChatQueryController#search

  • 自然语言分词
  • 基于HanLP词典前后缀识别联想关键词

img

4.自然语言解析NL2DSL2SQL

com.tencent.supersonic.chat.server.rest.ChatQueryController#parse

img

关键词匹配Mapper

  • 基于Embedding向量知识库EmbeddingMapper语义识别
  • 基于HanLP词典HanlpDictMapper前后缀识别
  • 基于汉明距离FuzzyNameMapper模糊识别

DSL生成Parser

  • 基于LLM和‘范例+规则+元数据关键词’提示的LLMSqlParser
  • 基于查询模式规则匹配和元数据关键词的RuleSqlParser

SQL语法修正Corrector

  • 元数据结构完整性修正的SchemaCorrector
  • 时间过滤条件修正的TimeCorrector
  • select、where、groupby、having子句预发修正的GrammarCorrector

img

规则化生成的请求LLM的prompt范例:

1
2
3
4
5
6
7
8
9
10
Prompt { text = "
#Role: You are a data analyst experienced in SQL languages.#Task: You will be provided with a natural language question asked by users,please convert it to a SQL query so that relevant data could be returned by executing the SQL query against underlying database.
#Rules:1.ALWAYS generate column specified in the `Schema`, DO NOT hallucinate.2.ALWAYS specify date filter using `>`,`<`,`>=`,`<=` operator.3.ALWAYS calculate the absolute date range by yourself.4.DO NOT include date filter in the where clause if not explicitly expressed in the `Question`.5.DO NOT miss the AGGREGATE operator of metrics, always add it if needed.6.ONLY respond with the converted SQL statement.
#Exemplars:
Question:比较jackjchen和robinlee今年以来的访问次数,Schema:Table=[超音数产品], Metrics=[<访问次数 ALIAS 'pv' COMMENT '一段时间内用户的访问次数' AGGREGATE 'SUM'>,<访问用户数 ALIAS 'UV,访问人数,' COMMENT '访问的用户个数' AGGREGATE 'COUNT'>,<人均访问次数 ALIAS '平均访问次数,' COMMENT '每个用户平均访问的次数'>], Dimensions=[<数据日期>], Values[<用户='jackjchen'>,<用户='robinlee'>],SideInfo:CurrentDate=[2020-12-01],DomainTerms=[<核心用户 COMMENT '核心用户指tom和lucy'>],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:SELECT 用户, 访问次数 FROM 超音数产品 WHERE 用户 IN ('jackjchen', 'robinlee') AND 数据日期 >= '2020-01-01' AND 数据日期 <= '2020-12-01'
Question:超音数过去90天美术部、技术研发部的访问时长,Schema:Table=[超音数产品], Metrics=[<访问时长 COMMENT '一段时间内用户的访问时长' AGGREGATE 'SUM'>], Dimensions=[<数据日期>], Values=[<部门='美术部'>,<部门='技术研发部'>],SideInfo:CurrentDate=[2023-04-21],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:SELECT 部门, 访问时长 FROM 超音数产品 WHERE 部门 IN ('美术部', '技术研发部') AND 数据日期 >= '2023-01-20' AND 数据日期 <= '2023-04-21'
Question:过去半个月核心用户的访问次数,Schema:Table=[超音数产品], Metrics=[<访问次数 ALIAS 'pv' COMMENT '一段时间内用户的访问次数' AGGREGATE 'SUM'>], Dimensions=[<部门>,<数据日期>], Values=[],SideInfo:CurrentDate=[2023-09-15],DomainTerms=[<核心用户 COMMENT '用户为alice'>],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:SELECT 用户,SUM(访问次数) FROM 超音数产品 WHERE 用户='alice' AND 数据日期 >= '2023-09-01' AND 数据日期 <= '2023-09-15'
Question:过去半个月忠实用户有哪一些,Schema:Table=[超音数产品], Metrics=[<访问次数 ALIAS 'pv' COMMENT '一段时间内用户的访问次数' AGGREGATE 'SUM'>], Dimensions=[<用户>,<数据日期>], Values=[],SideInfo:CurrentDate=[2023-09-15],DomainTerms=[<忠实用户 COMMENT '一段时间内总访问次数大于100的用户'>],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:SELECT 用户 FROM 超音数产品 WHERE 数据日期 >= '2023-09-01' AND 数据日期 <= '2023-09-15' GROUP BY 用户 HAVING SUM(访问次数) > 100
Question:超音数访问次数大于1k的部门是哪些,Schema:Table=[超音数产品], Metrics=[<访问次数 ALIAS 'pv' COMMENT '一段时间内用户的访问次数' AGGREGATE 'SUM'>], Dimensions=[<部门>,<数据日期>], Values=[],SideInfo:CurrentDate=[2023-09-14],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:SELECT 部门 FROM 超音数产品 WHERE 访问次数 > 1000
Question:sales 访问次数,Schema:Table=[超音数数据集], Metrics=[<访问次数 COMMENT '一段时间内用户的访问次数' AGGREGATE 'SUM'>,<访问用户数 ALIAS 'UV,访问人数,' COMMENT '访问的用户个数' AGGREGATE 'COUNT'>,<人均访问次数 ALIAS '平均访问次数,' COMMENT '每个用户平均访问的次数'>], Dimensions=[<数据日期>], Values=[<部门='sales'>],SideInfo:CurrentDate=[2024-10-12],PriorKnowledge=[数据日期 是分区时间且格式是yyyy-MM-dd],SQL:" }

5.执行SQL获取数据查询结果

com.tencent.supersonic.chat.server.rest.ChatQueryController#execute

四、改造适配

使用本地mysql

替换h2内存数据库为mysql,避免每次重启配置刷新

1
2
3
mysql -u root -p supersonic_v1 < /Users/jiazhengyang3/Desktop/schema-mysql.sql

mysql -u root -p supersonic_v1 < /Users/jiazhengyang3/Desktop/data-mysql.sql

换向量模型

可以在前端页面编辑保存,也可以直接编辑mysql数据库中s2_system_config表的parameters字段。通过debug接口/api/semantic/knowledge/meta/embedding/reload验证在加载向量库数据时是不是使用了变更配置的向量模型mofanke/dmeta-embedding-zh:latest。

1
[{"candidateValues":["OPEN_AI","AZURE","OLLAMA","QIANFAN","ZHIPU","LOCAL_AI","DASHSCOPE"],"comment":"接口协议","dataType":"list","defaultValue":"OPEN_AI","description":"","module":"对话模型配置","name":"s2.chat.model.provider","value":"OLLAMA"},{"comment":"BaseUrl","dataType":"string","defaultValue":"https://api.openai.com/v1","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"OPEN_AI":"https://api.openai.com/v1","AZURE":"https://your-resource-name.openai.azure.com/","OLLAMA":"http://localhost:11434","QIANFAN":"https://aip.baidubce.com","ZHIPU":"https://open.bigmodel.cn/","LOCAL_AI":"http://localhost:8080","DASHSCOPE":"https://dashscope.aliyuncs.com/api/v1"},"show":{"includesValue":["OPEN_AI","AZURE","OLLAMA","QIANFAN","ZHIPU","LOCAL_AI","DASHSCOPE"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.base.url","value":"http://localhost:11434"},{"comment":"Endpoint","dataType":"string","defaultValue":"llama_2_70b","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"QIANFAN":"llama_2_70b"},"show":{"includesValue":["QIANFAN"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.endpoint","value":"llama_2_70b"},{"comment":"ApiKey","dataType":"password","defaultValue":"demo","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"OPEN_AI":"demo","QIANFAN":"demo","ZHIPU":"demo","LOCAL_AI":"demo","AZURE":"demo","DASHSCOPE":"demo"},"show":{"includesValue":["OPEN_AI","QIANFAN","ZHIPU","LOCAL_AI","AZURE","DASHSCOPE"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.api.key","value":"demo"},{"comment":"SecretKey","dataType":"password","defaultValue":"demo","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"QIANFAN":"demo"},"show":{"includesValue":["QIANFAN"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.secretKey","value":"demo"},{"comment":"ModelName","dataType":"string","defaultValue":"gpt-3.5-turbo","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"OPEN_AI":"gpt-3.5-turbo","OLLAMA":"qwen:0.5b","QIANFAN":"Llama-2-70b-chat","ZHIPU":"glm-4","LOCAL_AI":"ggml-gpt4all-j","AZURE":"gpt-35-turbo","DASHSCOPE":"qwen-plus"},"show":{"includesValue":["OPEN_AI","AZURE","OLLAMA","QIANFAN","ZHIPU","LOCAL_AI","DASHSCOPE"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.name","value":"qwen:0.5b"},{"comment":"是否启用搜索增强功能,设为false表示不启用","dataType":"bool","defaultValue":"false","dependencies":[{"name":"s2.chat.model.provider","setDefaultValue":{"DASHSCOPE":"false"},"show":{"includesValue":["DASHSCOPE"]}}],"description":"","module":"对话模型配置","name":"s2.chat.model.enableSearch","value":"false"},{"comment":"Temperature","dataType":"slider","defaultValue":"0.0","description":"","module":"对话模型配置","name":"s2.chat.model.temperature","value":"0.0"},{"comment":"超时时间(秒)","dataType":"number","defaultValue":"60","description":"","module":"对话模型配置","name":"s2.chat.model.timeout","value":"60"},{"candidateValues":["IN_MEMORY","OPEN_AI","OLLAMA","AZURE","DASHSCOPE","QIANFAN","ZHIPU"],"comment":"接口协议","dataType":"list","defaultValue":"IN_MEMORY","description":"","module":"向量模型配置","name":"s2.embedding.model.provider","value":"OLLAMA"},{"comment":"BaseUrl","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"OPEN_AI":"https://api.openai.com/v1","OLLAMA":"http://localhost:11434","AZURE":"https://your-resource-name.openai.azure.com/","DASHSCOPE":"https://dashscope.aliyuncs.com/api/v1","QIANFAN":"https://aip.baidubce.com","ZHIPU":"https://open.bigmodel.cn/"},"show":{"includesValue":["OPEN_AI","OLLAMA","AZURE","DASHSCOPE","QIANFAN","ZHIPU"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.base.url","value":"http://127.0.0.1:11434"},{"comment":"ApiKey","dataType":"password","defaultValue":"","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"OPEN_AI":"demo","AZURE":"demo","DASHSCOPE":"demo","QIANFAN":"demo","ZHIPU":"demo"},"show":{"includesValue":["OPEN_AI","AZURE","DASHSCOPE","QIANFAN","ZHIPU"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.api.key","value":""},{"comment":"SecretKey","dataType":"password","defaultValue":"demo","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"QIANFAN":"demo"},"show":{"includesValue":["QIANFAN"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.secretKey","value":"demo"},{"comment":"ModelName","dataType":"string","defaultValue":"bge-small-zh","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"IN_MEMORY":"bge-small-zh","OPEN_AI":"text-embedding-ada-002","OLLAMA":"all-minilm","AZURE":"text-embedding-ada-002","DASHSCOPE":"text-embedding-v2","QIANFAN":"Embedding-V1","ZHIPU":"embedding-2"},"show":{"includesValue":["IN_MEMORY","OPEN_AI","OLLAMA","AZURE","DASHSCOPE","QIANFAN","ZHIPU"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.name","value":"mofanke/dmeta-embedding-zh:latest"},{"comment":"模型路径","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"IN_MEMORY":""},"show":{"includesValue":["IN_MEMORY"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.path","value":""},{"comment":"词汇表路径","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.model.provider","setDefaultValue":{"IN_MEMORY":""},"show":{"includesValue":["IN_MEMORY"]}}],"description":"","module":"向量模型配置","name":"s2.embedding.model.vocabulary.path","value":""},{"candidateValues":["IN_MEMORY","MILVUS","CHROMA"],"comment":"向量库类型","dataType":"list","defaultValue":"IN_MEMORY","description":"目前支持三种类型:IN_MEMORY、MILVUS、CHROMA","module":"向量库配置","name":"s2.embedding.store.provider","value":"IN_MEMORY"},{"comment":"BaseUrl","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.store.provider","setDefaultValue":{"MILVUS":"http://localhost:19530","CHROMA":"http://localhost:8000"},"show":{"includesValue":["MILVUS","CHROMA"]}}],"description":"","module":"向量库配置","name":"s2.embedding.store.base.url","value":""},{"comment":"ApiKey","dataType":"password","defaultValue":"","dependencies":[{"name":"s2.embedding.store.provider","setDefaultValue":{"MILVUS":"demo"},"show":{"includesValue":["MILVUS"]}}],"description":"","module":"向量库配置","name":"s2.embedding.store.api.key","value":""},{"comment":"DatabaseName","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.store.provider","setDefaultValue":{"MILVUS":""},"show":{"includesValue":["MILVUS"]}}],"description":"","module":"向量库配置","name":"s2.embedding.store.databaseName","value":""},{"comment":"持久化路径","dataType":"string","defaultValue":"","dependencies":[{"name":"s2.embedding.store.provider","setDefaultValue":{"IN_MEMORY":""},"show":{"includesValue":["IN_MEMORY"]}}],"description":"默认不持久化,如需持久化请填写持久化路径。注意:如果变更了向量模型需删除该路径下已保存的文件或修改持久化路径","module":"向量库配置","name":"s2.embedding.store.persist.path","value":""},{"comment":"超时时间(秒)","dataType":"number","defaultValue":"60","description":"","module":"向量库配置","name":"s2.embedding.store.timeout","value":"60"},{"comment":"纬度","dataType":"number","defaultValue":"","dependencies":[{"name":"s2.embedding.store.provider","setDefaultValue":{"MILVUS":"384"},"show":{"includesValue":["MILVUS"]}}],"description":"","module":"向量库配置","name":"s2.embedding.store.dimension","value":""},{"comment":"是否将Mapper探测识别到的维度值提供给大模型","dataType":"bool","defaultValue":"true","description":"为了数据安全考虑, 这里可进行开关选择","module":"Parser相关配置","name":"s2.parser.linking.value.enable","value":"true"},{"comment":"few-shot样例个数","dataType":"number","defaultValue":"3","description":"样例越多效果可能越好,但token消耗越大","module":"Parser相关配置","name":"s2.parser.few-shot.number","value":"5"},{"comment":"self-consistency执行个数","dataType":"number","defaultValue":"1","description":"执行越多效果可能越好,但token消耗越大","module":"Parser相关配置","name":"s2.parser.self-consistency.number","value":"1"},{"comment":"解析结果展示个数","dataType":"number","defaultValue":"3","description":"前端展示的解析个数","module":"Parser相关配置","name":"s2.parser.show.count","value":"3"}]

换向量库

提交代码增加pgvector向量库选择https://github.com/tencentmusic/supersonic/pull/1800。可以在前端页面编辑选择pgvector向量库,也可以如换向量模型直接编辑mysql数据库中s2_system_config表的parameters字段。

img

换大数据库

服务器借助docker单点部署doris集群:

1
2
3
4
5
6
docker run -d -it --name=doris -p 9030:9030 -p 8030:8030 apache/doris:build-env-ldb-toolchain-latest /bin/bash

docker cp mysql-5.7.22-linux-glibc2.12-x86_64.tar.gz doris:/opt
docker cp apache-doris-1.2.4.1-bin-x86_64.tar.xz doris:/opt

hostname -i 172.17.0.2