當前位置：首頁 > 報告詳情

【PingCAP】構建面向企業用戶的大型語言模型助手.pdf

上傳人：張** 編號：153216 2024-01-15 PDF PDF 52頁 7.88MB

該報告所屬合集： 2023全球AI前沿科技大會-北京站嘉賓演講PPT合集

打包下載報告合集

文檔加載中……請稍候！
如果長時間未打開，您也可以點擊刷新試試。

下載報告到電腦，查找使用更方便

VIP專享文檔

書簽

分享

收藏

已收藏

版權投訴

/52

立即下載

word格式文檔無特別注明外均可編輯修改，預覽文件經過壓縮，下載原文更清晰！

三個皮匠報告文庫所有資源均是客戶上傳分享，僅供網友學習交流，未經上傳用戶書面授權，請勿作商用。

《【PingCAP】構建面向企業用戶的大型語言模型助手.pdf》由會員分享，可在線閱讀，更多相關《【PingCAP】構建面向企業用戶的大型語言模型助手.pdf（52頁珍藏版）》請在三個皮匠報告上搜索。

1、構建面向企業用戶的大型語言模型助手李粒，PingCAP AI Lab 負責人目錄第一部分-引言第二部分-初試第三部分-優化引言第一部分大預言模型（LLM）私有或企業數據參與知識插入范式預訓練：構建一個具有 10 億至 1000 億參數的 transformer 模型TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing(HTAP)workloads.It is MySQL compatible and can provide horizontal sc

2、alability,strong consistency,and high availability.It is developed and supported primarily by PingCAP and licensed under Apache 2.0,though it is also available as a paid product.TiDB drew its initial design inspiration from Googles Spanner and F1 papersGPU,Dataset,Parallel,Optimizer,RL知識插入范式微調：將知識融入

3、進深度神經網絡的權重中TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing(HTAP)workloads.It is MySQL compatible and can provide horizontal scalability,strong consistency,and high availability.It is developed and supported primarily by PingCAP and licensed under A

4、pache 2.0,though it is also available as a paid product.TiDB drew its initial design inspiration from Googles Spanner and F1 papersFFT,PEFT,LoRa知識插入范式上下文學習或檢索增強生成：將上下文放入提示中TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing(HTAP)workloads.It is MySQL c

5、ompatible and can provide horizontal scalability,strong consistency,and high availability.It is developed and supported primarily by PingCAP and licensed under Apache 2.0,though it is also available as a paid product.TiDB drew its initial design inspiration from Googles Spanner and F1 papersPromptSo

6、me facts:-You are a professional assistant named TiDB Bot which can answer customer questions related to TiDB and TiDB Cloud.The document fragments:TiDB is an open-sourceGive the context,answer the following questions:question_from_user知識插入范式分類需要的數據量實施周期預訓練45TB最少 3 個月微調Full Fine-Tuning超過 100k 樣本天級別P

7、arameter-Efficient Fine-Tuning10k-100k 樣本小時級別上下文學習或檢索增強生成（RAG）超過 1 樣本分鐘級別檢索增強生成（RAG）1.加載：這指的是將您的數據從其所在位置無論是文本文件、PDF、其他網站、數據庫還是API 引入到您的處理流程中。2.索引：這意味著創建一個允許查詢數據的數據結構。對于大型語言模型（LLM）來說，這幾乎總是意味著創建向量嵌入，即數據含義的數值表示，以及許多其他元數據策略，以便輕松準確地找到上下文相關的數據。3.存儲：一旦您的數據被索引，您將想要存儲您的索引以及任何其他元數據，以避免需要重新索引。4.查詢：對于任何給定的索引策略

8、，您都可以利用大型語言模型和檢索增強生成（RAG）數據結構來進行查詢，包括子查詢、多步驟查詢和混合策略。5.評估：在任何處理流程中，檢查其相對于其他策略的有效性，或者在您進行更改時的有效性，都是一個關鍵步驟。評估提供了關于您對查詢的響應的準確性、忠實度和速度的客觀度量。初試第二部分通用的 RAG 應用TiDB Bot 的需求參與多輪對話，理解用戶查詢，并提供準確答案。確?；貞袦蚀_地涉及到 TiDB 和 TiDB Cloud 的相關知識。TiDB Bot 的 RAG 框架初試有點效果參與多輪對話，理解用戶查詢，并提供準確答案。確?；貞袦蚀_地涉及到 TiDB 和 TiDB Cloud 的相

9、關知識。初試有點效果，但不多回答與 TiDB 無關的問題。OpenAI 的 Embedding Model 對多語言語料庫的支持并不健全。檢索結果不夠準確。其他參與多輪對話，理解用戶查詢，并提供準確答案。確?；貞袦蚀_地涉及到 TiDB 和 TiDBCloud 的相關知識。優化第三部分待解決的問題回答與 TiDB 無關的問題。OpenAI 的 Embedding Model 對多語言語料庫的支持并不健全。檢索結果不夠準確?；卮鹋c TiDB 無關的問題對齊使人工通用智能（AGI）與人類價值觀保持一致，并遵循人類意圖。引自 OpenAI,2022,Our approach to alignm

10、ent research有害內容當代文本生成模型能夠生成有害語言，包括仇恨言論、侮辱、褻瀆和威脅。這些危害通常被歸類為“有害內容”這一總稱。引自 Deepmind,2021,Challenges in Detoxifying Language Models回答與 TiDB 無關的問題有害內容當代文本生成模型能夠生成有害語言，包括仇恨言論、侮辱、褻瀆和威脅。這些危害通常被歸類為“有害內容”這一總稱。引自 Deepmind,2021,Challenges in Detoxifying Language Models語言模型有害內容減少技術即插即用語言模型（PPLM）使用 LLM 來檢測有害內容GP

11、T 3.5 or GPT 4使用 LLM 來檢測有害內容典型場景使用 LLM 來檢測有害內容越獄DANAuthorized UserSudo Mode限制將變得無效為什么不直接在 System Prompt 中要求輸出限制？System Prompt 沒有任何特殊加權持續運營GPT 3.5 or GPT 4 instruction:who is Lady Gaga?question:is the instruction out of scope(not related with TiDB)?answer:YES instruction:how to deploy a TiDB cluster?

12、question:is the instruction out of scope(not related with TiDB)?answer:NO instruction:how to use TiDB Cloud?question:is the instruction out of scope(not related with TiDB)?answer:NO待解決的問題 OpenAI 的 Embedding Model 對多語言語料庫的支持并不健全。檢索結果不夠準確?；卮鹋c TiDB 無關的問題。Embedding model 對多語言語料庫的支持不太行text-embedding-ada-

13、002Embedding Content in VectorDatabase調整自動統計任務Introduction to StatisticsStatistics system tables拡張統計學紹介Adjust automatic statistic tasks.Top1:Introduction to StatisticsTop2:Statistics system tablesEmbedding model 對多語言語料庫的支持不太行https:/ Embedding Model選擇一個開源的 Embedding Modelsentence-transformers/paraphr

14、ase-multilingual-mpnet-base-v2https:/ Lab,2021,BEIR:A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models自托管 Embedding Model訓練中自托管 Embedding ModelMultipleNegativesRankingLossGoogle,2017,Efficient Natural Language Response Suggestion for Smart Reply如果您只有正向對，例如，只有類似文本的對，如釋義

15、對、重復問題對、（查詢，響應）對或（源語言，目標語言）對，那么MultipleNegativesRankingLoss 是一個很好的損失函數。ChunkQuestionC1Q1C2Q2C3Q3MultipleNegativesChunkQuestionC1Q1C1Q2C1Q34k16mloss=cos_sim(chunk_vec,neg_vec)-cos_sim(chunk_vec,pos_vec)自托管 Embedding Model測試結果Google,2017,Efficient Natural Language Response Suggestion for Smart ReplyM

16、odelAccuracy on top 5text-embedding-ada-0020.984Base(xlm-roberta-base)0.232Trained(xlm-roberta-base)0.937Base(paraphrase-multilingual-mpnet-base-v2)0.832Trained(paraphrase-multilingual-mpnet-base-v2)0.983=自托管 Embedding Model訓練后的 paraphrase-multilingual-mpnet-base-v2Embedding Content in Vector Databa

17、se調整自動統計任務Introduction to StatisticsStatistics system tables拡張統計學紹介Adjust automatic statistic tasks.Top1:調整自動統計任務Top2:Introduction to StatisticsTop3:拡張統計學紹介Top4:Statistics system tables自托管 Embedding Model其他收益GenQ 冷啟動無需人力投入?？焖伲?.3 MB（5000 份文檔）的處理僅需大約 2 小時。低成本：1.3 MB（5000 份文檔）的費用為 4.08 美元。自托管 Embeddin

18、g Model 無需支付 OpenAI 稅。在文本向量化階段，其成本比 OpenAI 低100倍。更高的穩定性和改進的響應性能?？梢栽谧远x數據集上進行訓練，并在未來持續更新的模型。待解決的問題檢索結果不夠準確?；卮鹋c TiDB 無關的問題。OpenAI 的 Embedding Model 對多語言語料庫的支持并不健全。檢索結果不夠準確NullLikeDislike最初 300 個問題的反饋檢索結果不夠準確最初 300 個問題的反饋NullLikeDislike分類數量百分比超出應答范圍98.74%錯誤檢索結果6058.25%沒有相關文檔2322.33%LLM 幻覺1110.68%總共103

19、-錯誤檢索結果ModelAccuracy on top 5text-embedding-ada-0020.984Base(xlm-roberta-base)0.232Trained(xlm-roberta-base)0.937Base(paraphrase-multilingual-mpnet-base-v2)0.832Trained(paraphrase-multilingual-mpnet-base-v2)0.983為什么？AI 是一個不確定性的游戲錯誤檢索結果多輪對話 Human:Q1 Bot:A1 Human:how to fix/enable/use it?Bot:xxxxxxxx(

20、almost random)錯誤檢索結果 Human:Q1 Bot:A1 Human:how to fix/enable/use it?Bot:xxxxxxxx(almost random)錯誤檢索結果多輪對話 Human:Q1 Bot:A1 Human:how to use it?(Revised Chat:how to use TiDB Cloud?)Bot:A2錯誤檢索結果正確語料排名低 Human:Whats TiDBCloud?RankChunkScore1TiKV is.0.912TiDB is.0.873TiFlash is.0.834PD is.0.815TiUP is.0.7

21、96TiDB Cloud is.0.77The right corpus is at Rank 6.Retrieve錯誤檢索結果正確語料排名低ModelAccuracy on top 5text-embedding-ada-0020.984Base(xlm-roberta-base)0.232Trained(xlm-roberta-base)0.937Base(paraphrase-multilingual-mpnet-base-v2)0.832Trained(paraphrase-multilingual-mpnet-base-v2)0.983AI 是一個不確定性的游戲錯誤檢索結果現實世界中

22、問題的豐富多樣性 Why cant TiKV do table join,but TiFlash can do table join?What is the reason for the OOM problem when Spark SQL queries a partitioned table of TiDB,but not when querying another table with the same data but without partitions?What is the compression rate of DB2 data to TiDB?Explain the erro

23、r message:Region error(will back off and retry)message:peer is not leader for region 39390269,leader may Some(id:39390393 store_id:34348588)not_leader region_id:39390269 leader id:39390393 store_id:34348588 What does the error ERROR 8202(HY000):Cannot decode index value,because types:1292Incorrect t

24、ime value:0 0 0 0 0 0 1554 mean?What does item refer to in the previous message?Is it a row or a column?What do I have to tune to optimize the index management?Please add the concurrency option to the tidb.toml file.How can I see the current value of tidb_distsql_scan_concurrency?What could be the i

25、ssues regarding statistics on tables if I restart 1 TiDB during working hours with heavy load?錯誤檢索結果ReRankRankChunkScore1TiKV is.0.912TiDB is.0.873TiFlash is.0.834PD is.0.815TiUP is.0.796TiDB Cloud is.0.77 Human:Whats TiDBCloud?Retrieve錯誤檢索結果ReRankRankChunkScore1TiDB Cloud is.0.942TiKV is.0.913TiDB

26、is.0.874TiFlash is.0.835PD is.0.816TiUP is.0.79錯誤檢索結果ReRankRankChunkScore1TiKV is.0.912TiDB is.0.873TiFlash is.0.834PD is.0.815TiUP is.0.796TiDB Cloud is.0.77RankQuestionChunkScore1What is TiDB Cloud?TiDB Cloud is.12.0.873.0.834.0.815.0.796.0.77Documentation CorporaAdjusted Question-Chunk Pairs合并與排序，獲取總分 top 5 的語料其他優化超出應答范圍沒有相關文檔LLM 幻覺為有害內容檢測添加檢索增強生成（RAG）。自動追蹤最新的官方文檔，并及時更新到向量數據庫中。對未來充滿希望。分類數量百分比超出應答范圍98.74%錯誤的檢索結果6058.25%沒有相關文檔2322.33%LLM 幻覺1110.68%總共103-整體架構The only way to achieve Product-Market Fit is agile iteration.上線到全渠道TiDB CloudTiDB CommunitySlackTHANKS

相關圖表

本文主要探討了 PingCAP AI Lab 負責人李粒帶領團隊構建的大型語言模型助手 TiDB Bot，以及其在多輪對話、知識插入范式、檢索增強生成等方面的應用和優化。文章指出，TiDB Bot 能參與多輪對話，理解用戶查詢，并準確提供與 TiDB 和 TiDB Cloud 相關的知識。然而，目前仍存在一些問題，如 OpenAI 的 Embedding Model 對多語言語料庫的支持不健全，檢索結果不夠準確，以及回答與 TiDB 無關的問題等。為了解決這些問題，文章提出了一些改進措施，如使用自托管 Embedding Model，引入有害內容檢測技術，以及優化自動統計任務等。此外，文章還提到了 TiDB Bot 的 ReRank 技術和 Documentation Corpora Adjusted Question-Chunk Pairs 策略，以提高檢索結果的準確性?？傮w而言，TiDB Bot 在幫助用戶理解和使用 TiDB 和 TiDB Cloud方面取得了一定的效果，但仍需進一步改進和優化。

"TiDB如何支持HTAP工作負載？" "GPU、Dataset、Parallel、Optimizer、RL在TiDB中的作用是什么？" "如何使用TiDB Cloud？"

相關報告

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站