《Cognite:2024年工業智能體大語言模型(LLM)與小語言模型(SLM)基準報告(英文版)(10頁).pdf》由會員分享,可在線閱讀,更多相關《Cognite:2024年工業智能體大語言模型(LLM)與小語言模型(SLM)基準報告(英文版)(10頁).pdf(10頁珍藏版)》請在三個皮匠報告上搜索。
1、LLM&SLMBenchmark Reportfor Industrial AgentsThe Cognite Atlas AICopyright,Cognite,2024 www.cognite.ai LLM&SLMBenchmark Reportfor Industrial AgentsThe Cognite Atlas AIThe Industrial AI Problem.4The Essential Roleof Natural Language Search.6Results&Analysis:Benchmarking NaturalLanguage Search Modelsfo
2、r Industry-Specifc Tasks.8Industrial Value is Acceleratedby Industrial Agents.12Methodology.14The Industrial AI ProblemLanguage models often generate plausible but incorrect responses,highlighting a key challenge in developing trustworthy AI solutions for industry.This makes rigorous evaluation esse
3、ntial to ensure reliability,accuracy,and effectiveness.Without proper evaluation,its impossible to know if your language model driven solution whether based on prompt engineering,Retrieval Augmented Generation(RAG),GraphRAG(Context Augmented Generation within Cognite Atlas AI),or fine-tuning truly w
4、orks,or how to improve it.General benchmark datasets,while useful,often fall short for specialized tasks.Standard benchmarks like Measuring Massive Multitask Language Understanding(MMLU)assess broad capabilities that may not directly apply to your specific use case.Tailored evaluations,on the other
5、hand,focus on the exact challenges the model is tasked to address.They offer more relevant insights,ensuring that youre measuring practical performance,not just some abstract capabilities.This also reduces the risk of“gaming”the system,a common issue with standardized tests,and provides clearer crit
6、eria for deciding if a new model is worth adopting.The Cognite Atlas AI LLM&SLM Benchmark Report for Industrial Agents addresses the shortcomings of general benchmark datasets by tailoring large language model(LLM)and small language model(SLM)evaluations to focus on specialized industrial tasks,ensu
7、ring the reliability,accuracy,and effectiveness of industrial AI solutions.The datasets used(see Methodology)are not made available to any model provider,thus preventing their inclusion in model training(another common means of“gaming”general benchmarks.)Long-term,this report will regularly evaluate
8、 an end-to-end range of AI tools that cover summarizing,analyzing,and reasoning with industrial data,to assess the full performance of industrial AI agents.This first edition of The Cognite Atlas AI Benchmark Report for Industrial Agents will focus on natural language search as a key data retrieval
9、tool for industrial AI agents with knowledge graph architecture:widely regarded as foundational to deliver trusted and verified data to LLMs and SLMsand to provide rules to contain the model.The Cognite Atlas AI LLM&SLM Benchmark Report for Industrial Agents45The Essential Role of Natural Language S
10、earchThe Cognite Atlas AI LLM&SLM Benchmark Report for Industrial AgentsNatural language search empowers users to access and retrieve data from the Industrial Knowledge Graph in Cognite Data Fusion using everyday language,removing the need for specialized coding skills.This dramatically improves acc
11、essibility,enabling domain experts to focus on solving business challenges rather than navigating complex data structures.Natural language search works by processing a users query in plain language,analyzing it for intent,and translating it into a structured GraphQL query.In contrast,code-based alte
12、rnatives demand a deep understanding of the datas underlying schema and query language.Natural language searchs seamless conversion from question to query enables the user to traverse vast and intricate industrial datasets,accelerating data retrieval,and unlocking valuable insights with greater spee
13、d and precision.To achieve accurate results in industrial contexts,its crucial to benchmark different LLMs and SLMs against relevant datasets to identify the best-performing model.Not all models are equally effective at handling industry-specific terminologies,workflows,and asset relationships,so it
14、s essential to evaluate how well each model interprets and responds to these specialized queries.Benchmarking enables organizations to assess which models deliver the most relevant,actionable results without needing extensive proprietary training.This ensures the chosen model aligns with the complex
15、ity and specificity of the industrial data it searches.This process helps optimize natural language search performance by selecting the model that best fits the task.Natural language search is at the core of all Cognite Atlas AI agents and enables them to autonomously perform data exploration in ord
16、er to solve the tasks they are given.It allows the agents to tackle complex tasks with minimal human input by leveraging the full power of the Industrial Knowledge Graph.Therefore,benchmarking natural language search is crucial,as it forms the foundation for agents to access and engage with the data
17、 required for all subsequent AI processes.67Results&Analysis:Benchmarking NaturalLanguage Search Modelsfor Industry-Specifc TasksThe graph shows the benchmark results for natural language search services for a collection of language models.The metric used here is an aggregated acceptable score for a
18、ll the GraphQL query fields.In this context,“acceptable”refers to a performance threshold that indicates whether a language model has generated a result that meets a predefined standard of accuracy or relevance.The“aggregated acceptable score”represents the percentage of GraphQL query fields where t
19、he models response was considered acceptablemeaning that the output was sufficiently accurate or relevant for the task at handgiving an overall performance metric for the language models ability to handle natural language search within specific industrial tasks.Note that the graph shows the scores r
20、elative to the highest-scoring model.The Top PerformerRounding Out The Top FiveClaude-3.5-SonnetClaude-3.5-sonnet is part of the Claude series,a family of large language models developed by Anthropic.Claude models are specifically trained to perform a wide range of natural language processing tasks,
21、including text generation,question answering,summarization,and advanced reasoning capabilities.The Claude-3.5-sonnet variant is an advanced iteration designed to handle more complex tasks with a higher degree of accuracy and fluency.Its performance in this benchmark indicates its suitability for tas
22、ks requiring deep understanding of industry-specific language and complex data structures.In addition,it is worth highlighting that this model achieved the highest score despite the prompts being initially optimized for GPT-3.5-turbo(16k)(see Methodology.This outcome hints at the models potential ro
23、bustness and flexibility in handling prompts not specifically tailored for it.GPT-3.5-Turbo(16k)GPT-3.5-turbo(16k)is part of one of the first widely adopted model families from OpenAI,valued for its balance of speed,versatility,and cost-effectiveness across a wide variety of tasks.It gained populari
24、ty for delivering reliable performance in diverse applications.GPT-4o-MiniGPT-4o-mini is a compact version of OpenAIs GPT-4o series.Though smaller,it demonstrates impressive competitive performance in industrial natural language search.Designed for applications that prioritize efficiency and cost-ef
25、fectiveness,its a more affordable alternative that still stands out in this benchmark for its ability to manage complex tasks with reasonable accuracy.Gemini-1.5-FlashGemini-1.5-flash is part of the Gemini series from Google DeepMind,designed for fast,low-latency natural language processing tasks,ma
26、king it highly efficient for real-time applications.While the models primary strength lies in its speed,it still offers solid performance in handling complex queries and tasks across various industries.GPT-4oGPT-4o is a larger model in OpenAIs GPT-4o family,optimized for tasks requiring a deep under
27、standing of context,reasoning,and conversation flow over long interactions.We observe,surprisingly,that GPT-4o is outperformed by smaller models in the GPT model family like GPT-3.5-turbo and GPT-4o-mini,though the underlying reasons for this remain unclear.It is possible that unoptimized prompts co
28、ntribute to this outcome,and further investigation into prompt optimization may reveal impacts on the models performance.The Cognite Atlas AI LLM&SLM Benchmark Report for Industrial Agents89The Cognite Atlas AI LLM&SLM Benchmark Report for Industrial AgentsLower-Tier Performance GapClaude-3.5-HaikuC
29、laude-3.5-haiku is a model developed by Anthropic,designed to prioritize speed and efficiency.It is the fastest model in the Claude series,optimized for high-throughput tasks such as analyzing large datasets and generating quick outputs.Llama3-70BLlama3-70B is one of the models in Metas Llama 3 seri
30、es,designed for tasks that require complex reasoning and detailed understanding of specialized domains.Mistral LargeMistral Large is Mistral AIs leading model,known for its ability to handle both coding and math tasks effectively,as well as large,complex documents.The significant performance gap com
31、pared to the top models may suggest these models struggle with the specialized industrial terminology or GraphQL query generation,or that advanced versions of the GPT and Claude series are generally superior.However,it is important to consider that the prompts used for evaluation were initially opti
32、mized for GPT-3.5-turbo(see Methodology).As such,its possible that the benchmark underrepresents the capabilities of Claude-3.5-haiku,Llama3-70B,and Mistral Large.Prompt engineering could play a significant role in maximizing each models performance.At the same time,and as discussed earlier,we obser
33、ve that Claude-3.5-Sonnet achieves the highest performance,even without prompt optimizations specifically tailored for this model.Despite the opportunity for prompt optimization,the benchmark results remain valuable,as they highlight the models relative strengths under standardized conditions,offeri
34、ng meaningful insights into their ability to handle industrial data search tasks.This makes the benchmark a reliable foundation for assessing model potential,even as we recognize the opportunity for further refinement through prompt engineering.The Main TakeawayThe results of this report highlight a
35、 key insight:the largest models are not always the most effective across all tasks.Smaller models,when properly optimized,can offer impressive results,particularly in tasks that demand speed,efficiency,and practical accuracy,rather than pure complexity.By tailoring smaller models through careful pro
36、mpt engineering and domain-specific fine-tuning,these models can perform exceptionally well in more focused tasks.This not only makes them more efficient but also more scalable and cost-effective for specific industrial applications.1011Industrial Valueis Acceleratedby Industrial AgentsThe Cognite A
37、tlas AI LLM&SLM Benchmark Report for Industrial AgentsAs mentioned above,natural language search is at the core of all Cognite Atlas AI agents.But why do AI agents matter at all?These tailored,AI-powered applications are designed with an in-depth understanding of industry-and customer-specific termi
38、nology,processes,and standards.They utilize algorithms and data models specifically optimized for the patterns and anomalies typical in a particular domain.And,they can be customized to fit the unique workflows and requirements of different organizations.As such,industrial agents can offer more accu
39、rate and relevant guidance and can be scaled to accommodate the growing data and complexity of operations as an organization expands.However,their success heavily depends on how well they are optimized for the specific industrial challenges they face.This is where benchmarking becomes criticalby rig
40、orously evaluating different models,The Cognite Atlas AI LLM&SLM Benchmark Report ensures that Cogite Atlas AI agents are powered by the most capable models for the task,fine-tuned to deliver the precision,reliability,and scalability needed for real-world industrial environments.Cognite Atlas AI is
41、an industrial agent workbench that extends Cognite Data Fusion.Cognite Atlas AI delivers everything necessary for you to build and orchestrate specialized industrial agents on top of your own data that are tailored to provide domain-specific insights and automate complex tasks.With Cognite Atlas AI,
42、you can:Leverage a library of pre-built industrial agents or use the low-code agent builder to create your own custom industrial agents Pick the best language modelbe that LLM,SLM,or customfor any given use case or industrial agent Leverage your Industrial Knowledge Graph with Context Augmented Gene
43、ration for safe and dependable results Get easy deployment of industrial agents into Cognite Data Fusion,proprietary applications,or other 3rd party applicationsCognite Atlas AI allows you to use generative AI to carry out more complex operations with greater accuracy,including workflow automation a
44、nd decision-making support,accelerating efficiencies that can generate tens of millions of dollars in business impact.1213The Cognite Atlas AI LLM&SLM Benchmark Report for Industrial AgentsMethodologyData SetTo evaluate natural language search,several specialized datasets were curated,each containin
45、g industry-relevant question-answer pairs.The benchmark dataset features six distinct Cognite Data Fusion data models from sectors such as oil&gas and manufacturing.input:A typical question posed by an industrial user ground_truth:The expected GraphQL query fields.relevant_types:Data model types per
46、tinent to the question.type:The specific data model type on which the GraphQL operation should be performed.operation:The appropriate GraphQL operation.filter:The list of possible valid filters that can be applied in the context of the question.properties:The properties most relevant to the question
47、,to be returned by the query.As demonstrated in the example,the evaluation dataset does not contain full GraphQL queries;instead,it focuses on the individual fields required to construct a valid query.This method allows for the separate testing of various components,enabling a more systematic and or
48、ganized approach to evaluating the generated answers.By isolating each field,it becomes easier to assess specific query elements in detail.Additionally,this modular approach enhances flexibility,making it easier to adapt the evaluation to other query languages,such as SQL.Each dataset includes tens
49、to hundreds of real-world question-answer pairs,providing a comprehensive foundation for assessing performance across a wide range of practical scenarios.This allows for a robust evaluation of the systems ability to handle diverse and complex industrial queries.Below is an example test case from an
50、Asset Performance Management(APM)data model:Evaluation PromptsTwo distinct sets of prompts were used in these benchmarks:one optimized for models utilizing JSON-mode(e.g.,GPT-4o model family and Gemini models),and another tailored for models where JSON-mode was not used.Its important to emphasize th
51、at the original prompts were specifically optimized for the GPT model family,particularly GPT-3.5-turbo,and only slightly adjusted for other non-JSON models.As such,the prompts were not fully optimized for each individual model,meaning that further tuning could significantly improve the performance
52、of some models.Therefore,it is important to note that the benchmarks results in this report may exhibit a bias toward GPT models.Evaluation MetricsTo assess the systems performance across various language models,multiple metrics were calculated.For fields containing a single string,such as“type”or“o
53、peration,”exact string matching was applied.For fields with multiple items,such as“properties,”metrics like recall,precision,and F1 score were used to evaluate the accuracy and completeness of the results.Additionally,each field was assigned a binary acceptable score,reflecting whether it meets the
54、minimum standards for correctness and relevance.For simple string fields,this score directly mirrors the result of the exact match,whereas for multi-value fields,it incorporates practical utilitymarking a field as acceptable if it offers a contextually appropriate and useful response,even if it isnt
55、 a perfect match.1415Learn More About Cognite Atlas AICognite Atlas AI An essential guide to rolling out Industrial DataOps;the critical first step to implementing AI for industry and embarking on your digital transformation journey.The Cognite Atlas AI Definitive Guideto Industrial Agents A comprehensive manual to accelerate AI innovation and reduce time to value of digital transformation programs.Copyright,Cognite,2024 www.cognite.ai Learn more at cognite.ai