《OpenAI:2024代理型AI系統管理實踐(英文版)(23頁).pdf》由會員分享,可在線閱讀,更多相關《OpenAI:2024代理型AI系統管理實踐(英文版)(23頁).pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、Practices for Governing Agentic AI SystemsYonadav ShavitSandhini AgarwalMiles BrundageSteven AdlerCullen OKeefeRosie CampbellTeddy LeePamela MishkinTyna EloundouAlan HickeyKatarina SlamaLama AhmadPaul McMillanAlex BeutelAlexandre PassosDavid G.RobinsonAbstractAgentic AI systemsAI systems that can pu
2、rsue complex goals with limited direct supervisionare likely to be broadly useful if we can integrate them responsibly into our society.While suchsystems have substantial potential to help people more efficiently and effectively achieve theirown goals,they also create risks of harm.In this white pap
3、er,we suggest a definition of agenticAI systems and the parties in the agentic AI system life-cycle,and highlight the importance ofagreeing on a set of baseline responsibilities and safety best practices for each of these parties.As our primary contribution,we offer an initial set of practices for k
4、eeping agents operationssafe and accountable,which we hope can serve as building blocks in the development of agreedbaseline best practices.We enumerate the questions and uncertainties around operationalizingeach of these practices that must be addressed before such practices can be codified.We then
5、highlight categories of indirect impacts from the wide-scale adoption of agentic AI systems,whichare likely to necessitate additional governance frameworks.Table of Contents1Introduction22Definitions42.1Agenticness,Agentic AI Systems,and“Agents”.42.2The Human Parties in the AI Agent Life-cycle.53Pot
6、ential Benefits of Agentic AI Systems63.1Agenticness as a Helpful Property.63.2Agenticness as an Impact Multiplier.74Practices for Keeping Agentic AI Systems Safe and Accountable74.1Evaluating Suitability for the Task.84.2Constraining the Action-Space and Requiring Approval.94.3Setting Agents Defaul
7、t Behaviors.104.4Legibility of Agent Activity.114.5Automatic Monitoring.124.6Attributability.134.7Interruptibility and Maintaining Control.14Lead authors.Correspondence should be directed to .15Indirect Impacts from Agentic AI Systems165.1Adoption Races.165.2Labor Displacement and Differential Adopt
8、ion Rates.175.3Shifting Offense-Defense Balances.175.4Correlated Failures.186Conclusion187Acknowledgements191IntroductionAI researchers and companies have recently begun to develop increasingly agentic AI systems:systems that adaptably pursue complex goals using reasoning and with limited direct sup
9、ervision.1For example,a user could ask an agentic personal assistant to“help me bake a good chocolate caketonight,”and the system would respond by figuring out the ingredients needed,finding vendorsto buy ingredients,and having the ingredients delivered to their doorstep along with a printedrecipe.A
10、gentic AI systems are distinct from more limited AI systems(like image generation orquestion-answering language models)because they are capable of a wide range of actions andare reliable enough that,in certain defined circumstances,a reasonable user could trust them toeffectively and autonomously ac
11、t on complex goals on their behalf.This trend towards agency mayboth substantially expand the helpful uses of AI systems,and introduce a range of new technicaland social challenges.Agentic AI systems could dramatically increase users abilities to get more done in their liveswith less effort.This cou
12、ld involve completing tasks beyond the users skill sets,like specializedcoding.Agentic systems could also benefit users by enabling them to partially or fully offload tasksthat they already know how to do,meaning the tasks can get done more cheaply,quickly,and atgreater scale.So long as these benefi
13、ts exceed the cost of setting up and safely operating an agenticsystem,agentic systems can be a substantial boon for individuals and society 1.In this paper,wewill primarily focus on agentic systems with language models at their core(including multimodalmodels),as these have driven recent progress.2
14、Society will only be able to harness the full benefits of agentic AI systems if it can make themsafe by mitigating their failures,vulnerabilities,and abuses 3.3This motivates our overarchingquestion:what practices could be adopted to prevent these failures,vulnerabilities,and abuses,and where in the
15、 life-cycle of creating and using agents are they best implemented?There areoften many different stages at which harm could have been prevented.For example,consider ahypothetical agentic AI assistant whose user(not based in Japan)directs it to purchase suppliesfor baking a Japanese cheesecake.Instea
16、d of purchasing supplies locally,the agent purchases anexpensive plane ticket to Japan,which the user only notices when it is too late to refund.In thishypothetical scenario,several parties could have prevented this outcome.The model developercould have improved the systems reliability and user-alig
17、nment4,so that it wouldnt have made this1See Section 2 for elaboration on this definition.2This is in contrast to earlier generations of agentic AI systems,which did not explicitly reason through language,such as the Deep Blue chess playing program from IBM that defeated Garry Kasparov 2.3In this co
18、ntext,a failure is when the agent fails to achieve some objective or does so in an unsatisfactory or harmfulmanner;a vulnerability is when the agent can be co-opted or undermined by an attacker,and an abuse is when anagent is used for harmful purposes.4In this paper,we will refer to user-alignment a
19、s the propensity of an AI model or system to follow the goalsspecified by a user.2mistake.The system deployer could have disabled the agent from taking action without explicitapproval.The user could have simply never agreed to delegate purchasing authority to an AI systemthat was commonly known to n
20、ot be fully reliable.The airline company could have even institutedpolicies or technologies that required affirmative human consent for purchases.Given that multipleparties could have taken steps to mitigate the damages,every party can arguably cast blame onthe other,and in the worst case a party ca
21、n be held responsible even when they could not havereasonably prevented the outcome4,5.A key goal of allocating accountability for harms from agentic AI systems should be to createincentives to reduce the likelihood and severity of such harms as efficiently as possible 6.In orderto make sure that so
22、meone is incentivized to take the necessary measures,it is important thatat least one human entity5is accountable for every uncompensated direct harm caused by anagentic AI system.Other scholarship has proposed more radical or bespoke methods for achievingaccountability,such as legal personhood for
23、agents coupled with mandatory insurance 7,8,ortargeted regulatory regimes 9.These all appear to address the same problem:in order to createincentives to reduce or eliminate harms from agentic AI systems,society needs to agree on baselinebest practices6that prudent model developers,system deployers,a
24、nd users are expected to follow.Given such a baseline,when an agentic AI system causes harm,we can identify which partiesdeviated from these best practices in a way that failed to prevent the harm.In this white paper,we lay out several practices that different actors can implement to mitigatethe ris
25、k of harm from agentic AI systems,which could serve as building blocks for a set of agreedbaseline best practices.We also highlight the many areas where operationalizing these practicesmay be difficult,especially where there could be tradeoffs among safety,usability,privacy,and cost.AI developers ca
26、nnot answer these questions alone,nor should they,and we are eager for furtherresearch and guidance from the wider world.In Section 2,we define agentic AI systems and the human parties in the agentic AI life-cycle.InSection 3,we briefly describe the potential benefits of agentic systems.In Section 4
27、,we provide aninitial seven practices that could be part of a set of agreed best practices for parties in the agentlife-cycle and highlight open questions.Finally,in Section 5,we consider more indirect impacts fromthe introduction of AI agents that may not be addressable by a focus on individual har
28、ms.We hope that the best practices we outline can serve as building blocks for a society-widediscussion about how to best structure accountability for risks from agentic AI systems.For example,they may inform discussion around what regulation of AI agent development might look like,or howparties str
29、ucture contracts regarding agents(e.g.insurance for harms caused by agents,terms of useregarding agents),or how courts could think of various actors standards of care.Given the nascentstate of agents and their associated scholarship,we do not yet have strong recommendations onhow accountability ough
30、t to be structured,and would like to see a more robust public discussion ofpossible options.We hope that this paper will help catalyze such conversations,without anchoringor biasing them too strongly in any particular direction.5That is,an individual,corporation,or other legal entity,but not(solely)
31、an AI system itself.6We refer to baseline best practices here rather than,e.g.,the legal concept of a professional“standard of care”theset of actions a reasonable and prudent party is expected to take,such that deviating from this standard opens themup to legal responsibility from the resulting harm
32、since the former could provide a foundation for the latter,and mayalso inform policymaking outside of courtrooms(e.g.,through legislation and regulation).32Definitions2.1Agenticness,Agentic AI Systems,and“Agents”Agentic AI systems are characterized by the ability to take actions which consistently c
33、ontributetowards achieving goals over an extended period of time,without their behavior having been specifiedin advance.In the cultural imagination,an AI agent is a helper that accomplishes arbitrary tasksfor its user,like Samantha from Her or HAL 9000 from 2001:A Space Odyssey.Such agents arevery d
34、ifferent from current AI systems like GPT-4,which,while surprisingly knowledgeable andclever in some ways,can thus far only complete a limited range of real-world tasks.Yet there is noclear line along which to draw a binary distinction between“agents”and current AI systems likeGPT-4.Instead,an AI sy
35、stems agenticness is best understood as involving multiple dimensions,along each of which we expect the field to continue to progress.We define the degree of agenticness in a system as“the degree to which a system can adaptablyachieve complex goals in complex environments with limited direct supervi
36、sion.”Agenticness asdefined here thus breaks down into several components:7Goal complexity:How challenging would the AI systems goal8be for a human to achieveand how wide of a range of goals could the system achieve?Properties of the goal may includetarget levels of reliability,speed,and safety.Exam
37、ple:An AI system that can correctly answer users analytical questions acrossprogramming and law would have greater goal-complexity than a text classifier that canonly classify the same inputs as belonging to law or programming.Environmental complexity:How complex are the environments under which a s
38、ystemcan achieve the goal?(E.g.,to what extent are they cross-domain,multi-stakeholder,requireoperating over long time-horizons,and/or involve the use of multiple external tools.)Example:An AI system that can play any board game expertly has greater environment-complexity than an AI system that can
39、only play chess,because the first system cansucceed under a far greater range of environments(including chess)than the second.Adaptability:How well can the system adapt and react to novel or unexpected circumstances?Example:Automated rule-based customer-service systems have lower adaptability thanhu
40、man customer-service representatives,since humans can address unexpected or un-precedented customer requests.Independent execution:To what extent can the system reliably achieve its goals withlimited human intervention or supervision?Example:Cars capable of level 3 autonomous driving 10,which can op
41、erate withouthuman intervention under certain circumstances,have greater independent executionthan traditional cars that require continuous human operation.7We recognize that a variety of definitions of agenticness,agents,and agency are used by various people for variouspurposes.In our assessment ma
42、ny or all of the practices we discuss in the remainder of the paper are applicable acrossmany alternative definitions of these terms.8We will assume that an agentic AI system can be modeled as having goals,including externally-defined goals suchas following a set of provided instructions.4Following
43、recent literature 3,we will generally refer to systems exhibiting high degrees ofagenticness as“agentic AI systems,”to emphasize that agenticness as we use it here is a propertyrather than a category/classification,though we will sometimes use“agents”as it is the prevailingterm of art in some contex
44、ts.This work will focus on the range of effects and best practices thatmay become relevant as systems agenticness increases.9We emphasize that agenticness is a distinctconcept from consciousness,moral patienthood,or self-motivation,and distinguish a systems degreeof agenticness from its anthropomorp
45、hism.10Indeed,we will generally conceptualize agentic AIsystems as operating in pursuit of goals defined by humans and in environments determined byhumans(and often in cooperation with human“teammates”),rather than fully-autonomous systemsthat set their own goals.Agenticness as we define it is also
46、not tied to physicality(i.e.,many digitalsystems are more agentic in the sense above than most robots),but certain kinds of“independentexecution”that have physical consequences(e.g.in a driverless car)can increase the risks andopportunities of agenticness in particular applications.Lastly,agenticnes
47、s is conceptually distinctfrom an AI systems level of performance on a given task or the generality of its capabilities,thoughimprovements in performance and generality may“unlock”the ability of a system to act as an agentin certain contexts 12.2.2The Human Parties in the AI Agent Life-cycleWe provi
48、de a simplified overview of the agentic AI life-cycle,though there are many differentconfigurations of these roles in the AI industry13 and we hope further taxonomies emerge.In ourtaxonomy,11the three primary parties that may influence an AI agents operations are the modeldeveloper,the system deploy
49、er,and the user.The model developer is the party that developsthe AI model that powers the agentic system,and thus broadly sets the capabilities and behaviorsaccording to which the larger system operates.The system deployer is the party that builds andoperates the larger system built on top of a mod
50、el,including by making calls to the developed model(such as by providing a“system prompt”14),routing those calls to tools with which the agent cantake actions,and providing users an interface through which they interact with the agent.Thesystem deployer may also tailor the AI system to a specific us
51、e case,and thus may frequently havemore domain-specific knowledge than the model developer or even the user.Finally,the agents useris the party that employs the specific instance of the agentic AI system,by initiating it and providingit with the instance-specific goals it should pursue.The user may
52、be able to most directly overseecertain behaviors of the agentic system through its operation,during which it can also interact withthird parties(e.g.other humans,or the providers of APIs with which the agent can interact).Sometimes,the same entity will fulfill multiple roles,such as the same compan
53、y both developinga model and then deploying it via an API(making them both the model developer and one ofthe system deployers).Other times,multiple entities will share a role,such as when one companytrains a model and a second company fine-tunes it for their application,making them share the9For exa
54、mple,LLMs are being augmented with tools/scaffolding to increase their scores on the dimensions ofagenticness,including“chain-of-thought”to help with strategic reasoning,“code execution”to help with independentexecution,and“browsing”to help with adaptability,etc.)11.10Agenticness does not imply or r
55、equire a human-like appearance or human-like behavior,though anthropomorphicappearances and behavior may increase the likelihood of humans perceiving such systems as agentic and have otherimplications for responsible design and deployment.11We use this taxonomy as a useful mental model for enabling
56、division of practices across the agent lifecycle andto better highlight open questions.These are not intended to establish a prescriptive framework for allocation ofresponsibility.Such responsibility may vary depending on the context.For example,for an agent that performs medicaldiagnoses,if the age
57、nt is deployed in a hospital more responsibility may fall on the user(a doctor),whereas if theagent is a consumer app marketed as a personal diagnostic tool,more perhaps more responsibility should fall on thesystem deployer(the app developer).5responsibilities of a“model developer.”12We will also oc
58、casionally mention other relevant actors,including the compute provider(which operates the chips and other infrastructure on whichagentic AI systems run)and third-parties which interact with the user-initiated AI system.We illustrate with the specific example of a scheduling assistant built on OpenA
59、Is AssistantsAPI.OpenAI developed the GPT-4 model,making it the model developer.13OpenAI deployedthe infrastructure(including serving the model and connecting it to tools such as a code executionenvironment),and the application developer builds an app on top of it(e.g.,by building a user-interface,c
60、hoosing a system prompt,and supplying an email template for the system to use whensending invites),meaning they both share the role of system deployer.Finally,a customer initiates asession with the scheduling assistant and specifies which goals(e.g.scheduling requirements)theydlike the system to sat
61、isfy,making them the user.3Potential Benefits of Agentic AI SystemsIn this section,we take stock of the ways that agentic AI systems have the potential to benefitsociety.First,we consider the ways that a more agentic version of a particular AI system might bemore beneficial than a less agentic versi
62、on(agenticness as a helpful property).Second,we considerthe ways in which agenticness can enable wider diffusion of AI in beneficial applications in society,and is often implicit in many definitions of and visions for AI(agenticness as an impact multiplier).While our discussion in this section is br
63、ief,this should not be read as an indication that the list ofpossible benefits is necessarily short,or that the magnitude of those benefits is small.Nor do wemake claims that the benefits clearly outweigh the risks or vice versa.3.1Agenticness as a Helpful PropertySpecific AI systems may in many cas
64、es be more beneficial in proportion to the extent to which theyare agentic,provided they are designed safely and that appropriate best practices for safety andaccountability are applied.Agenticness can make a particular system more beneficial in ways suchas the following:Higher quality and more reli
65、able outputs:for example,a language model that is capable ofbrowsing the Internet autonomously,and revising its queries in response to the results itreceives,may be capable of providing much more accurate answers to questions than a systemthat is not able to do so.This may be particularly true in in
66、stances involving topics that aredynamic in nature or events that occurred after the underlying model was trained.More efficient use of users time:for example,if a user provides high level instructions to anAI system regarding code they want the system to produce,it may be smoother for the userif th
67、e system performs several steps autonomouslye.g.translating the instructions into code,running the code,displaying the results,assessing those results,and making edits to the codein order to improve outcomes.Improved user preference solicitation:For example,personal assistant AI that is capableof in
68、teractively sending messages to its users in order to ask clarifying questions in naturallanguage,and that does so at strategically appropriate times,may provide a better experience12The important question of how to split the responsibility for different best practices across the multiple entitiesth
69、at may share a single agent-life-cycle role is beyond the scope of this current whitepaper.13If the application developer fine-tuned the model on their custom data,they may share the“model developer”responsibilities.6than an app with numerous complex configurations that is difficult for users to lev
70、erageeffectively.Scalability:An agentic AI system may allow a single user to take many more actions thanthey could otherwise,or be capable of benefiting a much larger number of people than a lessagentic version of the same system.Consider the example of radiology.A non-agentic radiologyimage classif
71、ication tool may be helpful for making a radiologist slightly more efficient,but anagentic radiology tool that was capable of completing certain patient-care tasks without humansupervision(piling reports on the scan,asking patients basic follow-up questions)could potentially increase a radiologists
72、efficiency substantially and leave more time for seeingmany more patients.15.3.2Agenticness as an Impact MultiplierIn addition to analyzing the implications of agenticness in the context of particular AI systems,one can also view agenticness as a prerequisite for some of the wider systemic impacts t
73、hat manyexpect from the diffusion of AIsome of which have significant potential to benefit society.Insofar asagenticness is a definitional or practical prerequisite for that diffusion,the impacts of agenticnessmay be closely related to the impacts of AI more generally.In this sense,the impacts of AI
74、 generallyare likely to be more frequent and more pronounced,and to happen sooner,to the extent thatagenticness increases,making agenticness an“impact multiplier”of the field of AI as a whole.Sometimes agenticness is implicitly assumed when people talk about current or future AIcapabilities.OpenAIs
75、Charter defines artificial general intelligence(AGI)as“highly autonomoussystems that outperform humans at most economically valuable work,”and canonical textbookssuch as Russell and Norvigs Artificial Intelligence:A Modern Approach emphasize agenticness intheir conception of AI.Given these considera
76、tions,we briefly review several commonly expectedimpacts of AI as an overall technological field.Even without significant further advances in agenticness,AI is likely to already constitute ageneral-purpose technology.Historically,the widespread adoption of general purpose technologiessuch as the ste
77、am engine and electricity has vastly increased the global standard of living overtime(though also brought about significant harm for many,and in particular for less powerfulor privileged groups,living through those periods).Highly capable and agentic AI systems thatare widely deployed could even imp
78、rove economic productivity so much that they fundamentallychange the nature of work,potentially and perhaps more speculatively enabling a“leisure society”or“post-work”world,though this is by no means guaranteed and would carry risks 16.Additionally,AI could accelerate progress on various non-economi
79、c measures of societal wellbeing,such as thoseencapsulated in the Sustainable Development Goals,and by accelerating scientific progress andunderstanding.The economic and other productivity gains some expect from AI may be greater tothe extent that agentic AI systems are able to take actions autonomo
80、usly 17.4Practices for Keeping Agentic AI Systems Safe and AccountableBelow,we suggest a range of practices different parties can adopt to ensure agentic AI systemsoperate safely and in accordance with users intents,and to create accountability when harm doesoccur.When implemented together,the pract
81、ices outlined in this section are intended to provide a“defense-in-depth”approach to mitigating risks from agentic AI systems.Though many of thesepractices are employed in some form today,we highlight many open questions around how theyshould be operationalized.We also discuss how additional precaut
82、ions may be needed as AI systems7become more agentic.We emphasize that these practices alone are insufficient for fully mitigatingthe risks from present day AI systems,let alone mitigating catastrophic risks from advanced AI.Forexample,none of the principles below covers methods for ensuring the cyb
83、ersecurity of agents so asto prevent them from being hijacked by attackers,even though we expect this to be a significantchallenge that requires new practices.The practices discussed here are intended as an initial outlineof approaches and relevant considerations.We avoid discussion of what technica
84、l best practices to use in order to build capable and user-aligned agentic AI systems.These are both rapidly evolving fields,and practices are changing rapidly,such that we do not expect the fields to converge on“best practices”for guaranteeing particularAI capabilities or user-alignment in the near
85、 term.In addition,the science required to predict thecapabilities/user-alignment of an AI model given training choices is in its infancy 18.This meansthat it is currently not possible for a model developer to deterministically guarantee a modelsexpected behavior to downstream system deployers and us
86、ers.There are exceptions,such as howfully excluding a training sample from the training data will mean that the model cannot regurgitateit.Still,given the limited degree to which model behavior can be delimited in advance,we willfocus on designing a set of best practices that is agnostic to the part
87、icular models method of training.Open Question:What harm mitigations,if any,are primarily attainable via technical choices in the modelstraining process?What might corresponding best practices be?4.1Evaluating Suitability for the TaskEither the system deployer or the user should thoroughly assess wh
88、ether or not a given AI modeland associated agentic AI system is appropriate for their desired use case:whether it can executethe intended task reliably across the range of expected deployment conditions(or,to the extentreliability is not necessary or expected given the low stakes of the task and th
89、e nature of the userinterface,that user expectations are suitably established via that interface).This raises the questionof how to properly evaluate an agentic AI system,and what failure modes can and cannot be foreseenby sufficient testing.The field of agentic AI system evaluation is nascent,with
90、more questions than answers,sowe offer only a few observations.Evaluating agentic AI systems raises new challenges on top ofthe already significant challenges with evaluating current language models 19.This is in partbecause successful agents may often need to execute long sequences of correct actio
91、ns,so that even ifindividual actions would only fail infrequently,these rare events could compound and make failure indeployment likely.One solution is for system deployers to independently test the agents reliabilityin executing each subtask.For example,when an early system deployer was building an
92、 AWStroubleshooting agent on top of OpenAIs GPT-4 API,they broke down the agents needed subtasksinto“information gathering,”“calculations,”and“reasoning,”and created evaluations for eachindependently.Breaking down all the subtasks that could be encountered in a complex real-worldoperating domain may
93、 sometimes be too difficult for system deployers;one approach could be toprioritize doing such evaluations for agents use of high-risk actions,like financial transactions.Even if the system is shown to do individual subtasks reliably,this still raises the problem of howto evaluate whether the agent
94、will reliably chain these actions together.Finally,agentic systemsmay be expected to succeed under a wide range of conditions,but the real world contains a long tailof tasks which are difficult to define and events which are hard to anticipate in advance(includingthose that emerge from human-agent o
95、r agent-agent interactions).Similar difficulties with evaluatingreliability under unanticipated conditions have significantly slowed the deployment of self-driving8cars 20,and one might expect a similar effect for agentic AI systems.Ultimately,there are currentlyfew better solutions than to evaluate
96、 the agent end-to-end in conditions(whether simulated or real)as close as possible to those of the deployment environment.So long as our ability to bound and evaluate the behaviors of agentic AI systems remainsimmature,system deployers and users may need to lean more heavily on other practices(such
97、ashuman-approval for high-stakes actions)in order to bound the behavior of these systems.A separate evaluation challenge for model developers and system deployers is how to determinewhat scale of harm their agentic system could enable,whether by a user intentionally,or by ac-cident due to failures o
98、f user-alignment.For example,frontier model developers could test theirmodels for capabilities that would facilitate harm such as generating individualized propaganda orassisting in cyberattacks.14It may be important to require system deployers(or model develop-ers operating on their behalf)to do su
99、ch evaluations in order to determine what other measuresthey should take to mitigate misuse of the agentic AI system services they provide.Such guid-ance is currently under development by the US government 21 and the international community 22.Open Questions:How can system deployers and users effect
100、ively evaluate the agentic systems level of reliabilityin their use case?What constitutes“sufficient”evaluation?How can system deployers effectively evaluate the combination of agent and user,and identifybehaviors and potential failures that only emerge through human-agent interaction?Given the hete
101、rogeneous nature of real-world deployment,what failure modes cannot beexpected to be detected in advance via evaluation?What evaluations of agents capabilities should be expected to be done by the model developer,rather than the system deployer?(E.g.universally useful checks,such as the systemsprope
102、nsity to act in alignment with the users goals.)How can system deployers communicate to the user the intended conditions under which theagentic system can be used reliably,and at what point does a users unintended usage of asystem make them responsible for resulting harms?What misusable agentic syst
103、em capabilities should model developers and system deployers beobligated to test for,both for specific sectors and for agents in general?4.2Constraining the Action-Space and Requiring ApprovalSome decisions may be too important for users to delegate to agents,if there is even a small chancethat they
104、re done wrong(such as independently initiating an irreversible large financial transaction).Requiring a user to proactively authorize these actions,thus keeping a“human-in-the-loop”23,isa standard way to limit egregious failures of agentic AI systems.15This raises the key challenge ofhow a system de
105、ployer should ensure that the user has enough context to sufficiently understand theimplications of the action theyre approving.This is also made harder when the user must approvemany decisions and thus must make each approval quickly,reducing their ability to meaningfullyconsider each one 24.14Open
106、AI has committed to testing for these and other model capabilities as part of its Preparedness work.15As noted by Crootof et al,23,a human-in-the-loop may serve various roles beyond simply improving the reliabilityof the human-machine system(e.g.,assigning liability,preserving human dignity).9In som
107、e cases,agentic AI systems should be prevented from taking certain actions entirely,tobetter bound the systems operational envelope and thus enable safe operation 25.For instance,it may be prudent to prevent agentic AI systems from controlling weapons.Similarly,to mitigateaccidents resulting from ag
108、ents running for unintentionally long periods without supervision,systemdeployers can cause agents to periodically“time out”until a human reviews and reauthorizesthem.Of course,the risk-mitigation benefits of these measures must be carefully balanced withthe reduction in agentic AI systems capabilit
109、ies and utility.Such operational limitations couldbe placed by the model developer at the model-level,by the system deployer via hardcoding or amonitoring system(Section 4.5),and/or proactively set by the user themselves.As the agenticness of AI systems increases,hard-coded restrictions may cease to
110、 be as effective,especially if a given AI system was not trained to follow these restrictions,and thus may seek toachieve its goals by having the disallowed actions occur.An AI agent could circumvent a hard-codedrestriction by causing another party to take the action on the systems behalf,while hidi
111、ng theresulting potential impact from the user.For instance,an agent could send an emailan allowedactionto a non-user human that convinces said human to take the disallowed action.Systemdeployers can bound this problem by ensuring strong network controls to limit agents interactionswith the outside
112、world except through monitored channels.Agentic systems could also be sandboxedin order to prevent systems with cybersecurity capabilities from escaping these controls(especiallyduring development when a systems capabilities are uncertain),but current sandboxing systemsmay not be well-suited to effe
113、ctively bound highly-capable AI agents.Open Questions:How should a user or system deployer determine and enforce which interactions need humanapproval,as a function of a models empirical reliability on that task and other factors?Arethere certain interactions,such as formation of legal contracts,whe
114、re a users approval shouldalways be required?What design practices should system deployers follow to ensure users have sufficient contextbefore approving an action?What are the best practices for users reviewing approvals for high-cost actions(such asminimum review times)to avoid their turning into
115、a“rubber stamp”for agentic AI systemsthat cannot catch harmful actions?What are the best practices for system deployers to test that agentic AI systems follow theaction constraints the system deployer specified?How can the effectiveness and meaningfulness of human oversight be evaluated?How can exis
116、ting sandboxing mechanisms be modified to work for agentic AI systems?Howcan model developers or system deployers test the reliability of a sandbox for a given agenticsystem without incurring the actual risk of the agentic system breaking that sandbox,or awider nested sandbox test environment?4.3Set
117、ting Agents Default BehaviorsModel developers could significantly reduce the likelihood of the agentic system causing accidentalharm by proactively shaping the models default behavior according to certain design principles.Forinstance,user interactions with agentic AI systems may be designed to begi
118、n with a prompt to the10user to communicate their goals and preferences to the system.This preference information willalmost always be unclear or incomplete:users dont want to have to tell their life story just to gethelp baking a cake.It is still valuable for the agent to have a set of default comm
119、on-sense backgroundpreferences that allow it to“fill in the gaps”without a users guidance,such as“users prefer if Idont spend their money.”In the absence of user-specific information,one common-sense heuristiccould be to err toward actions that are the least disruptive ones possible,while still achi
120、eving theagents goal 26.It should often still be possible for the user to overrule these default preferences ifrequested specifically,though it may also be important to have agents themselves refuse to executeuser-intended harm(Section 4.2).To avoid agentic systems being overconfident about users ob
121、jectives,model developers andsystem deployers may be advised to build in features that cause agents to be aware of their ownuncertainty about users intended goals 27.Agents can be trained or prompted to proactivelyrequest clarifications from the user to resolve this uncertainty,especially when it ma
122、y change theiractions 28,29.However,better understanding of users alone does not guarantee the agent willpursue the right objectives.For example,instead of producing truthful outputs with which the usermay disagree,certain AI systems have been found to pander to users based on what beliefs theythink
123、 a given user holds 30,31,which may reflect a deficiency of current techniques to align AIsystems with their users true goals.Having agents request information too frequently can also raiseissues with usability and privacy(if the preference information is sensitive).Open Questions:What other default
124、 behaviors could model developers and system deployers instill in agenticAI systems that could mitigate the possibility of errors and harms?How should these default behaviors be balanced,when in conflict?How is responsibility allocated between the model developer(who may not have intended fortheir m
125、odel to be used in a particular agentic system)and the system deployer,when it comesto instilling certain behaviors in AI systems?4.4Legibility of Agent ActivityThe more a user is aware of the actions and internal reasoning of their agents,the easier it can befor them to notice that something has go
126、ne wrong and intervene,either during operation or afterthe fact.Revealing an agents“thought process”to the user enables them to spot errors(includingidentifying when a system is pursuing the wrong goal),allows for subsequent debugging,and instillstrust when deserved.Conveniently,current language mod
127、el-based agentic systems can produce atrace of their reasoning in natural language(a so-called“chain-of-thought”32),which provides aconvenient source of truth for how the system reached a conclusion on which action to take.It couldbe useful for system deployers to expose all details of the agents in
128、teractions,such as any inputs itreceives from tool-use API calls or interactions with other agents.This could have the added benefitof enabling users to detect when a malicious third party(such as a third-party agent)is attemptingto manipulate the primary agents operations 33.However,“chain-of-thoug
129、ht”transparency comes with challenges and cannot yet be fully relied on.Early work has shown that sometimes models do not actually rely on their chains-of-thought whenreasoning 34,so relying on these may create a false sense of security in the user.An alternativeapproach could instead be to expose d
130、ifferent reasoning-related data to users to help them understand11models behavior,such as probes of models internal representations 35,36.(Such methods arestill nascent.)A separate challenge is that chains-of-thought are growing longer and more complicated,asagents produce thousands of words per act
131、ion or are integrated into more complex architectures(e.g.“Trees-of-Thought”37 or“Graphs-of-Thought”38).The length and complexity of these“internalmonologues”may balloon beyond a users ability to feasibly keep up.(One potential solution maybe to leverage AI-based monitoring,as discussed in Section 4
132、.5.)Other challenges arise when agentsintegrate new modalities like images,which may contain hidden messages that a human user cannotparse 39.Future AI systems may also perform better if they can reason in compressed,moreefficient ways,which may not necessarily be readable to humans,and these inscru
133、table architecturesmay bring significant safety and accountability costs.It is unfortunately not possible to simply“ask”the agent to retroactively justify its behavior,as this is likely to produce confabulated reasoning 40.It may be particularly important that system deployers provide the user with
134、a ledger of actionstaken by the agent.This is a lighter-touch method than requiring human approval(Section 4.2),and can thereby give users visibility into a wider range of actions without substantially slowingdown the systems operation.Actions that can only be reviewed after the fact should be moree
135、asily reversible than those that require approval.Much like when requiring action approvals,itmay be important to also provide an agents reasoning to the user to properly contextualize the action.Open Questions:Should system deployers be required to provide their agents“internal monologue”to the use
136、r?Should all AI agent reasoning,including inter-agent communication,be required to occur innatural language,in order to make subsequent auditing by humans easier?What methods can model developers and system deployers take in order to produce accountsof agents thoughts that are more faithful to the l
137、ogic the system truly employed?How can model developers make their models reason in ways that are more legible to the useror system deployer?What user-experience practices should system deployers follow in order for users to be able toeasily understand and act on agents internal reasoning?How can us
138、ers easily identify the actions that are most likely to be mistaken or ill-informed,among many actions taken by their agent?4.5Automatic MonitoringIn practice,human users may not always have the time to go through the agent activity logs exposedby the system deployer at the speed or scale they desir
139、e.To address this,users or system deployerscan set up a second“monitoring”AI system that automatically reviews the primary agentic systemsreasoning and actions(made legible as in Section 4.4)to check that theyre in line with expectationsgiven the users goals.This monitoring AI system could be a clas
140、sifier,or a generative AI systemcapable of producing its own chains-of-thought 41.Such automated monitors operate at a speedand cost that human monitoring cannot hope to match,and may be able to parse modalities(suchas detecting adversarially-perturbed images)that a human could not.Monitoring can be
141、 providedas a service by the system deployer,or set up by the user in case they wish to exercise additionalcontrol.12Monitoring,however,comes with substantial costs and can be a double-edged sword.If sensitiveagent session data needs to be retained by the system deployer,it may create privacy risks
142、for theuser(and by extension the system deployer).Depending on which party carries out monitoring,evenmonitoring intended solely to benefit the user could give rise to concerns about how that informationwill be used,and push users to self-censor 42,43.A separate challenge is how to prevent monitorin
143、gfunctions from being over-extended and being used as a means of centralized control,especially byauthoritarian governments 44.Additionally,monitoring poses some practical and technical challenges.One challenge of suchmonitoring is that it adds costs:if the same foundation model(prompted differently
144、)is usedto classify all text outputted by the primary model,the total inference cost may be doubled,ormultiplied even further if the monitor model produces its own chain-of-thought for each primarymodel output.Smaller AI models may be useful for doing monitoring more cheaply,but doingso raises the r
145、isk that the primary AI system could misbehave in ways the smaller model cantreliably catch.(On the other hand,recognizing incorrect behavior may be easier than generating fullbehaviors that are correct,making savings possible.)This may create pressure for system deployersin competitive markets to m
146、inimize monitoring measures in order to undercut each other on price.There is also a recursive issue with ensuring the reliability and user-alignment of monitorsystems:if an input causes the primary agent to malfunction,it may cause the monitoring systemto malfunction,too.For example,a malicious pro
147、mpt injection that hijacks the primary system,when read by the monitoring system,could similarly apply to and hijack the monitoring system andprevent it from reporting the incident(meaning that the harm will go entirely undetected)45.While the ultimate value of automated monitoring remains uncertain
148、,given the potential utilityof AIs-monitoring-AIs as a means of overseeing future high-risk AI systems,it may be important toinvestigate methods that allow us to carry out monitoring effectively while appropriately balancingconcerns related to privacy and centralized control.Open Questions:How can s
149、ystem deployers verify that a monitor is capable of successfully rooting out mistakesfrom the primary AI system,particularly when they may have been trained in a similar manner,or when the agent is substantially more intelligent than the monitor?What types of human oversight of a primary AI system m
150、ay still be useful when AI-basedmonitoring is also in use?What key behaviors should monitoring systems monitor for that would otherwise break implicitsafety practices,such as agents undermining a users“blank slate”assumption by storinginformation somewhere on the internet so that they can retrieve i
151、t in a future session?How can automatic monitoring systems be made to monitor for as-yet-unknown harms?4.6AttributabilityIn cases where preventing intentional or unintentional harms at the level of the user or systemdeployer is infeasible(such as a criminal operating an AI agent to scam a third part
152、y),it may stillbe possible to deter harm by making it likely that the user would have it traced back to them.With the creation of reliable attribution,it could become possible to have reliable accountability.One idea for such a system of attribution is to have each agentic AI instance assigned a uni
153、queidentifier,similar to business registrations,which contains information on the agents user-principal13and other key accountability information.16 17It may be valuable to keep such agent identificationoptional and allow anonymity in many circumstances,so as to limit potentially harmful surveillanc
154、eof AI usage.But in high-stakes interactions,such as those involving private data or financialtransactions,third parties(including external tool providers)interacting with a users agent coulddemand such identification before starting the interaction,to ensure they know a human user can beheld accoun
155、table if something goes wrong.Given the substantial incentives for bad actors to spoofsuch a system(similar to the pressures that exist for identity-verification protocols in the financialindustry47),making this system robust may be an important challenge.Such attribution for individual interactions
156、 does not cover everything:in some cases AI agentsmay be used to cause harm to individuals who never had a chance to identify them(e.g.agents assist-ing a hacker in developing an exploit),for which alternative accountability approaches may be needed.Open Questions:How can society practically enable
157、AI agent identity verification?What existing systems,suchas internet certificate authorities,can be adapted to facilitate such verification?What other ideas exist for practically enabling agentic AI system attributability?4.7Interruptibility and Maintaining ControlInterruptibility(the ability to“tur
158、n an agent off”),while crude,is a critical backstop for preventingan AI system from causing accidental or intentional harm.System deployers could be required tomake sure that a user can always activate a graceful shutdown procedure for its agent at any time:both for halting a specific category of ac
159、tions(revoking access to,e.g.,financial credentials)andfor terminating the agents operation more generally.This graceful fallback is also useful in theevent that agents crash,such as due to internet outages.There may be some cases where shuttinga system down may cause more harm than good(e.g.a malfu
160、nctioning agent that is nonethelessassisting in a life-threatening emergency),but by investing in interruptibility,one can minimize thescenarios in which users are stuck with only poor options.Ensuring graceful interruptibility is challenging when agents are terminated mid-action-sequence(e.g.,while
161、 scheduling a five-person business meeting,when only two invites had been sent so far).An important principle for addressing this could be to have agents always pre-construct a fallbackprocedure if theyre turned off(e.g.,pre-launching a script that would automatically notify the twoinvitees that the
162、 agent has terminated and thus the meeting may not occur).A significant challengeis how to maintain such fallback procedures as agents action-sequence complexities increase.Incertain circumstances it may even be that any graceful fallback procedure would itself requiresignificant agentic behavior(th
163、ough perhaps by a separate AI agent).A second important principle could be that an agent should not be able to halt or tamper withthe users attempt to shut them down,as might be the case if the agent is malfunctioning,or ifthe agent or its surrounding system has an instrumental goal of self-preserva
164、tion 30.It may beimportant for model developers or system deployers to deliberately design agentic systems to place“shut down gracefully when requested by the user”as a primary goal,above whatever other goals16For example,it could reference the AI model powering the agent,and any certifications it h
165、as received 46.Itcould even include information about the datacenter and chip powering the agent,for purposes of interruptibility(Section 4.7).,if there were a way for the user to be able to verify this information(e.g.,by way of datacenters signingagent outputs,or even hardware-level signing).17Sim
166、ilarly,if an AI system is associated with a bank account into which it deposits(or extracts)funds,authoritiescould track access to that account as a basis for identifying human principals.14the system was provided with,though this may be infeasible in certain situations(such as those inwhich an agen
167、t is assisting in a life-threatening emergency).Interruptibility requirements shouldlikely also extend recursively to any other sub-aents the agent has initiated:all spawned sub-aentsshould be gracefully terminatable by the original user.Sometimes,a user may be unwilling or unable to shut down their
168、 AI system,even as it causesharm.This raises important challenges around how third parties communicate to a user that theiragent is causing harm,and in what circumstances an external party has the right or obligation toterminate a users agent.In cases where it is merited,to be able to stop an agent
169、quickly duringan incident,society could encourage redundancy in the number of human parties that can turnoffan AI agent instance.The two relevant parties are the system deployer and the data centeroperator or chip owner on whose hardware the AI system is running.If an agentic AI system causessignifi
170、cant ongoing harm that they could have halted,these parties could themselves bear some ofthe responsibility.In order for such shutdowns to be viable,the system deployer or chip operatormay need to maintain awareness of roughly what agentic AI jobs they are running,though this mustbe done with signif
171、icant care to avoid harms to privacy.It may even be desirable to automaticallytrigger such shutdowns if risk indicators cross a certain threshold(like an influx of new jobs fromunknown accounts),similar to stock market circuit breakers that are triggered at a given thresholddrop in price.As AI syste
172、ms levels of agenticness increase,there is a risk that certain model developers,systemdeployers,and users would lose the ability to shut down their agentic AI systems.This could bebecause no viable fallback system exists(e.g.,in a similar sense that no one can“shut down”theglobal banking system or t
173、he electric grid without very significant costs),or because the agent hasself-exfiltrated its code to facilities beyond its initiators grasp.We can begin to take steps that makethis worst case scenario less likely,by establishing the degree to which model developers,systemdeployers,and users will be
174、 held accountable for the harms caused by the agent even after humancontrol has been lost.This could incentivize them to develop stronger methods of control,makingthe worst case scenario less likely.Open Questions:How can model developers and system deployers design their systems to ensure that agen
175、ticsystems have graceful fallbacks in case theyre shut down or interrupted,for the broad range ofactions an agent might take?Are there principles by which a second agentic AI system couldbe used as the fallback,and where might this approach fail?In what settings is interruptibility users responsibil
176、ity,rather than model developers or systemdeployers?For instance,should users be considered responsible for only approving an agentsaction if it is coupled with a fallback procedure?How can system deployers ensure that agents only spawn sub-aents that can be similarlyturned off?Under what circumstan
177、ces,if any,should an agent ever be able to(or be incentivized to)prevent its own termination?What information should system deployers or compute providers keep track of(such as agentIDs,as in Section 4.6),in order to help determine that a system theyre hosting has causedsignificant harm,and needs to
178、 be turned off?How can such information be minimized tosatisfy the strong need for user privacy?15What restrictions should exist on such shutdowns,to prevent them from being abused to policeharmless or low-stakes usage of agents?How realistic is it for agentic AI systems to resist being shut down in
179、 the near-term?Howrealistic is it for an agentic AI system to be integrated into a social process or criticalinfrastructure(including unintentionally)such that the cost of shutting it down would becomeprohibitive?If either scenario did happen,what are the likeliest pathways,and what signalsmight be
180、observed in the run-up(by the system deployer and user,or by outside parties)thatcan be used to trigger intervention ahead of time?How should different parties responsibilities be allocated in the event of the non-interruptibilityof an AI system that causes harm?5Indirect Impacts from Agentic AI Sys
181、temsIn addition to direct impacts from individual agentic AI systems,there will be indirect impactsthat result collectively from the usage of many different AI systems and societys reaction to theirusage 48.Just as it would have been difficult to anticipate the full range of societal readjustmentsfr
182、om previous general-purpose technologies like electricity and computers,one should”expect theunexpected.”Still,we do think there are several categories of indirect impacts from agentic AIsystems that are likely to require active mitigation by society,which we list below.These indirect impacts may be
183、 addressed at least in part by adopting best practices for users,system deployers,and model developers,such as those outlined in Section 4.However,fullyaddressing these complex challenges will likely require additional strategies beyond this papersproposals,including through industry-wide collaborat
184、ions and society-wide mitigations.Somestrategies towards this end may be domain or risk-specific,while others may involve placing generalrequirements on the usage of certain types of agentic AI systems.5.1Adoption RacesGiven the advantages that agents may confer in competitive environments,such as c
185、ompetitionbetween private firms or governments 49,50,51,52,there may be significant pressure for competitorsto adopt agentic AI systems without properly vetting those systems reliability and trustworthiness53.A key observation driving such premature reliance is that agentic AI systems may succeed at
186、 a taskon average,while being unreliable in rare but important cases which can be missed or ignored bycompetitors under pressure.For example,consider a hypothetical class of agentic AI code-generation systems that can rapidlywrite new code,but whose code occasionally contains serious security flaws.
187、If a software developmentcompany thinks their competitor has been using these coding systems without human supervision asa way to quickly build new features,they may feel pressured to do the same without doing properdue diligence,as they might otherwise lose market-share to their competitor.As a res
188、ult,all firmscodebases would now be vulnerable to serious cyberattacks,even if each individual firm wouldvepreferred to go slower and thereby avoid this outcome 54.This trend toward overrapid adoption,even in high-risk domains,can lead to over-reliance,whereby humans trust agentic AI systemswithout
189、fully understanding their limitations.This could create the conditions for widespread use ofunsafe AI systems that in the worst case may prove catastrophic 55,56.165.2Labor Displacement and Differential Adoption RatesAgentic AI systems appear likely to have a more substantive impact on workers,jobs,
190、and productivitythan static AI systems.Traditional AI systems excel at some routine work,but increasing agenticnesscould expand what tasks are“routine”enough to be assisted or automated by AI(such as by adaptingto unexpected conditions,gathering relevant context,and calibrating to a users preference
191、s).Thismeans they may expose a greater number of jobs and tasks to augmentation and automation,similarto other axes of AI system improvement like tool use17.This could result in a range of differenteconomic effects.These could lead to substantive boosts in worker productivity and economic growth,but
192、 could also result in the displacement of a large number of workers,either because their jobs arefully automated or because their skills are made less rare and thus their jobs become more precarious.At the same time,agentic AI systems may improve education and enable workers to upskill intonew jobs.
193、It is also possible that agentic AI systems can increase the agency and productivity ofindividual workers or small firms more than traditional AI systems have done,such as by increasingthe availability of previously rare expertise.This may or may not offset large firms advantages incapital(e.g.their
194、 ability to run more agents)and preexisting market position(such as firms withaccess to proprietary data that can be used to train bespoke agents).Even similarly-positioned individuals and firms may differ in their ability to leverage agentic AIsystems.Different individuals jobs and firms business s
195、trategies may be more or less amenable to AIagent automation,depending on the particular order in which each AI agent capability is unlockedand becomes reliable.Individuals who lack digital literacy,technology access,or representationin design decisions around agentic AI systems may find themselves
196、less able to participate in anagentic-AI-system-fueled world.However,AI agents could also reduce the technology access gap,much like smartphones increased internet access to underserved populations 57(though some gapsremain58).All these effects may alter the job landscape and business environment un
197、evenly,andincrease the importance of taking active policy measures to ensure the benefits of increasingly agenticAI systems are in fact shared broadly.5.3Shifting Offense-Defense BalancesSome tasks may be more susceptible to automation by agentic AI systems than others.Thisasymmetry is likely to und
198、ermine many current implicit assumptions that undergird harm mitigationequilibria in our society(known as“offense-defense balances”59),with unclear consequences.Forexample,in the cyber domain,human monitoring and incident response is still key to cyber-attackmitigation.The feasibility of such human
199、monitoring is predicated on the fact that the volume ofattacks is similarly constrained by the number of human attackers.Consider the hypothetical whereagentic AI systems can substantially automate cyber-attacker responsibilities and thus dramaticallyexpand the volume of attacks,but cyber-defender r
200、esponsibilities such as monitoring are muchharder to automate.In such a world,the overall effect of agentic AI systems would be to makecyberdefense less viable and make information systems less secure.Conversely,if agentic AI systemsmake monitoring and response cheaper than producing new cyberattack
201、s,the overall effect would beto make cyberdefense cheaper and easier.18While it is very difficult to anticipate the net effect of agent adoption dynamics in a particulardomain in advance,one can be confident that some processes will be much more amenable toautomation than others,and that numerous so
202、cietal equilibria may shift as a result.It behoovesactors to pay close attention in identifying which equilibrium assumptions no longer hold,and to18Any such offense-defense analysis should also include the extent to which agents themselves represent a new attacksurface,and thus could create new vul
203、nerabilities that need to be secured.17quickly respond,such as by investing in differential technological development towards defender-oriented technologies 60.5.4Correlated FailuresAgentic AI systems may bring unexpected failure modes,and a particular risk arises when a largenumber of AI systems al
204、l fail at the same time,or all fail in the same way.These correlated errorscan occur due to“algorithmic monoculture”:the observation that AI systems trained using thesame or similar algorithms and data can make them malfunction in similar ways61,62.Thereis already evidence that language models train
205、ed on similar data distributions suffer from similarvulnerabilities,such as adversarial prompts that corrupt one system generalizing to corruptingother similarly trained systems63.Similarly,biases in common training datasets,when used bymany different model developers,could expand the biased behavio
206、r of individual AI systems into asociety-wide harm(such as by all agents suppressing the same news article in recommendations,orreinforcing stereotyped representations against the same social group).More broadly,AI systemsmay be vulnerable to disruption in shared infrastructure(e.g.power or internet
207、 outages).Such correlated failures may be more dangerous in agentic AI systems as they could be delegatedmore power by humans,and thus the potential consequences of their failure could be greater.They may also be exacerbated because agentic AI systems may shape each others informationenvironments an
208、d even directly communicate with each other,allowing for much more direct andeven deliberate propagation of certain failures.It is particularly challenging to guard against suchcorrelated failures because they are a joint function of the individual AI system and its constantly-changing environment.O
209、ne initial path forward is to create visibility and monitoring in the agenticAI ecosystem,to catch such wide-scale issues as they emerge.Correlated failures may be particularly hard to deal with because they may overtax the fallbacksystems intended to remedy individual agents failures,but which are
210、unprepared for large-scalefailures.This may be especially acute in cases where the fallback plan is to have humans manuallytake over for each malfunctioning agent.For example,if a companys loan approval chatbot generallyfails 1%of the time and has a small number of staffers to handle those failures,
211、then a rare correlatedfailure that takes down 100%of the chatbots would bring the loan-approval system to a halt.However,the rarity of this risk may make it difficult to discern from routine operation alone,andthus could make it challenging for concerned employees inside the company to justify the c
212、ost ofretaining adequate stafffor such a seemingly hypothetical failure.In the longer term,if/as certainhuman tasks are entirely replaced by agentic AI systems,human expertise in certain domains mayatrophy and make us entirely dependent on agentic AI systems(and their attendant failure modes).It may
213、 be particularly important for policymakers and the AI ecosystem to find ways to ensure thatfallback mechanisms for agentic AI systems are robust to these sorts of correlated failures.6ConclusionIncreasingly agentic AI systems are on the horizon,and society may soon need to take significantmeasures
214、to make sure they work safely and reliably,and to mitigate larger indirect risks associatedwith agent adoption.We hope that scholars and practitioners will work together to determinewho should be responsible for using what practice,and how to make these practices reliable andaffordable for a wide ra
215、nge of actors and affordable.Agreeing on such best practices is also unlikelyto be a one-time effort.If there is continued rapid progress in AI capabilities,society may need torepeatedly reach agreement on new best practices for each more capable class of AI systems,inorder to incentivize speedy ado
216、ption of new practices that address these systems greater risks.187AcknowledgementsWe would like to thank Seth Lazar,Tim Hwang,Chris Meserole,Gretchen Krueger,RebeccaCrootof,Dan Hendrycks,Gillian Hadfield,Rob Reich,Meredith Ringel Morris,Josh Albrecht,MattBoulos,Laura Weidinger,Daniel Kokotajlo,Jaso
217、n Kwon,Artemis Seaford,Michael Kolhede,MichaelLampe,Andrea Vallone,Christina Kim,Tejal Patwardhan,Davis Robertson,Hannah Rose Kirk,Ashyana-Jasmine Kachra,and Karthik Rangarajan for their helpful advice,feedback,and comments,which were integral in the development of this white paper.References1A.Cost
218、inot,“On the Origins of Comparative Advantage,”Journal of International Economics,vol.77,p.255264,Apr.2009.2M.Campbell,A.J.Hoane,and F.-h.Hsu,“Deep Blue,”Artificial Intelligence,vol.134,p.5783,Jan.2002.3A.Chan,R.Salganik,A.Markelius,C.Pang,N.Rajkumar,D.Krasheninnikov,L.Langosco,Z.He,Y.Duan,M.Carroll
219、,M.Lin,A.Mayhew,K.Collins,M.Molamohammadi,J.Burden,W.Zhao,S.Rismani,K.Voudouris,U.Bhatt,A.Weller,D.Krueger,and T.Maharaj,“Harms from increasingly agentic algorithmic systems,”in 2023 ACM Conference on Fairness,Accountability,and Transparency,p.651666,June 2023.arXiv:2302.10329 cs.4J.Cobbe,M.Veale,an
220、d J.Singh,“Understanding accountability in algorithmic supply chains,”in Proceedings of the 2023 ACM Conference on Fairness,Accountability,and Transparency,pp.11861197,2023.5M.C.Elish,“Moral Crumple Zones:Cautionary Tales in Human-Robot Interaction(pre-print),”Mar.2019.6W.M.Landes and R.A.Posner,The
221、 Economic Structure of Tort Law.Harvard UniversityPress,Oct.2013.7 L.B.Solum,“Legal personhood for artificial intelligences,”Mar.2008.8D.Vladeck,“Machines without principals:Liability rules and artificial intelligence,”WashingtonLaw Review,vol.89,p.117,Mar.2014.9M.Anderljung,J.Barnhart,A.Korinek,J.L
222、eung,C.OKeefe,J.Whittlestone,S.Avin,M.Brundage,J.Bullock,D.Cass-Beggs,B.Chang,T.Collins,T.Fist,G.Hadfield,A.Hayes,L.Ho,S.Hooker,E.Horvitz,N.Kolt,J.Schuett,Y.Shavit,D.Siddarth,R.Trager,and K.Wolf,“Frontier ai regulation:Managing emerging risks to public safety,”Nov.2023.arXiv:2307.03718cs.10A.Chan,R.
223、Salganik,A.Markelius,C.Pang,N.Rajkumar,D.Krasheninnikov,L.Langosco,Z.He,Y.Duan,M.Carroll,et al.,“Harms from increasingly agentic algorithmic systems,”in Proceedings of the 2023 ACM Conference on Fairness,Accountability,and Transparency,pp.651666,2023.11S.Yao,J.Zhao,D.Yu,N.Du,I.Shafran,K.Narasimhan,a
224、nd Y.Cao,“React:Synergizingreasoning and acting in language models,”arXiv preprint arXiv:2210.03629,2022.1912M.R.Morris,J.Sohl-dickstein,N.Fiedel,T.Warkentin,A.Dafoe,A.Faust,C.Farabet,and S.Legg,“Levels of agi:Operationalizing progress on the path to agi,”Nov.2023.arXiv:2311.02462 cs.13J.Cobbe,M.Vea
225、le,and J.Singh,“Understanding Accountability in Algorithmic Supply Chains,”in 2023 ACM Conference on Fairness,Accountability,and Transparency,p.11861197,June2023.arXiv:2304.14749 cs.14J.Shieh,“BestPracticesforPromptEngineeringwithOpenAIAPI.”https:/ P.Rajpurkar,“Autonomous aisystems in the face of li
226、ability,regulations and costs,”npj Digital Medicine,vol.6,p.13,Oct.2023.16M.ClanceyandT.Besiroglu,“TheGreatInflection?ADebateAboutAIandExplosiveGrowth.”https:/ D.Rock,“Gpts are gpts:An early look at the labormarket impact potential of large language models,”Aug.2023.arXiv:2303.10130 cs,econ,q-fin.18
227、S.R.Bowman,“Eight things to know about large language models,”Apr.2023.arXiv:2304.00612cs.19Anthropic,“Challenges in Evaluating AI Systems.”https:/ J.M.Zllner,“Anomaly detection in autonomous driving:Asurvey,”in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW),p.4
228、4874498,June 2022.arXiv:2204.07974 cs.21“ExecutiveOrderontheSafe,Secure,andTrustworthyDe-velopmentandUseofArtificialIntelligence.”https:/www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
229、,2023.Accessed:11-13-2023.22“TheBletchleyDeclarationbyCountriesAttendingtheAISafetySummit.”https:/www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023,2023.Accessed:11-13-2023.23 R.C
230、rootof,M.E.Kaminski,and W.N.Price II,“Humans in the Loop,”Mar.2022.24J.Zerilli,A.Knott,J.Maclaurin,and C.Gavaghan,“Algorithmic decision-making and thecontrol problem,”Minds and Machines,vol.29,p.555578,Dec.2019.2025H.Khlaaf,“Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems,”Tr
231、ail of Bits,2023.26V.Krakovna,L.Orseau,R.Ngo,M.Martic,and S.Legg,“Avoiding side effects by consideringfuture tasks,”in Advances in Neural Information Processing Systems,vol.33,p.1906419074,Curran Associates,Inc.,2020.27S.Russell,Human Compatible:Artificial Intelligence and the Problem of Control.Pen
232、guin,2019.28D.Hadfield-Menell,S.J.Russell,P.Abbeel,and A.Dragan,“Cooperative inverse reinforcementlearning,”Advances in neural information processing systems,vol.29,2016.29R.Shah,P.Freire,N.Alex,R.Freedman,D.Krasheninnikov,L.Chan,M.D.Dennis,P.Abbeel,A.Dragan,and S.Russell,“Benefits of Assistance ove
233、r Reward Learning,”Oct.2020.30E.Perez,S.Ringer,K.Lukoit,K.Nguyen,E.Chen,S.Heiner,C.Pettit,C.Olsson,S.Kundu,S.Kadavath,A.Jones,A.Chen,B.Mann,B.Israel,B.Seethor,C.McKinnon,C.Olah,D.Yan,D.Amodei,D.Amodei,D.Drain,D.Li,E.Tran-Johnson,G.Khundadze,J.Kernion,J.Landis,J.Kerr,J.Mueller,J.Hyun,J.Landau,K.Ndous
234、se,L.Goldberg,L.Lovitt,M.Lucas,M.Sellitto,M.Zhang,N.Kingsland,N.Elhage,N.Joseph,N.Mercado,N.DasSarma,O.Rausch,R.Larson,S.McCandlish,S.Johnston,S.Kravec,S.E.Showk,T.Lanham,T.Telleen-Lawton,T.Brown,T.Henighan,T.Hume,Y.Bai,Z.Hatfield-Dodds,J.Clark,S.R.Bowman,A.Askell,R.Grosse,D.Hernandez,D.Ganguli,E.Hu
235、binger,N.Schiefer,and J.Kaplan,“Discoveringlanguage model behaviors with model-written evaluations,”2022.31M.Sharma,M.Tong,T.Korbak,D.Duvenaud,A.Askell,S.R.Bowman,N.Cheng,E.Durmus,Z.Hatfield-Dodds,S.R.Johnston,S.Kravec,T.Maxwell,S.McCandlish,K.Ndousse,O.Rausch,N.Schiefer,D.Yan,M.Zhang,and E.Perez,“T
236、owards understanding sycophancy in languagemodels,”2023.32J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le,D.Zhou,et al.,“Chain-of-thought prompting elicits reasoning in large language models,”Advances in NeuralInformation Processing Systems,vol.35,pp.2482424837,2022.33K.Greshake,S.Abdelnabi,S.M
237、ishra,C.Endres,T.Holz,and M.Fritz,“More than youveasked for:A comprehensive analysis of novel prompt injection threats to application-integratedlarge language models,”arXiv preprint arXiv:2302.12173,2023.34T.Lanham,A.Chen,A.Radhakrishnan,B.Steiner,C.Denison,D.Hernandez,D.Li,E.Durmus,E.Hubinger,J.Ker
238、nion,et al.,“Measuring faithfulness in chain-of-thought reasoning,”arXivpreprint arXiv:2307.13702,2023.35A.Azaria and T.Mitchell,“The Internal State of an LLM Knows When its Lying,”arXivpreprint arXiv:2304.13734,2023.36A.Zou,L.Phan,S.Chen,J.Campbell,P.Guo,R.Ren,A.Pan,X.Yin,M.Mazeika,A.-K.Dombrowski,
239、et al.,“Representation engineering:A top-down approach to ai transparency,”arXiv preprint arXiv:2310.01405,2023.37S.Yao,D.Yu,J.Zhao,I.Shafran,T.L.Griffiths,Y.Cao,and K.Narasimhan,“Tree of thoughts:Deliberate problem solving with large language models,”arXiv preprint arXiv:2305.10601,2023.2138M.Besta
240、,N.Blach,A.Kubicek,R.Gerstenberger,L.Gianinazzi,J.Gajda,T.Lehmann,M.Podstawski,H.Niewiadomski,P.Nyczyk,et al.,“Graph of thoughts:Solving elaborateproblems with large language models,”arXiv preprint arXiv:2308.09687,2023.39X.Qi,K.Huang,A.Panda,M.Wang,and P.Mittal,“Visual adversarial examples jailbrea
241、klarge language models,”arXiv preprint arXiv:2306.13213,2023.40T.Chakraborti and S.Kambhampati,“(how)can ai bots lie?a formal perspective on explana-tions,lies,and the art of persuasion,”2020.41W.Saunders,C.Yeh,J.Wu,S.Bills,L.Ouyang,J.Ward,and J.Leike,“Self-critiquing modelsfor assisting human evalu
242、ators,”arXiv preprint arXiv:2206.05802,2022.42 D.Lyon,Surveillance society:Monitoring everyday life.McGraw-Hill Education(UK),2001.43J.W.Penney,“Chilling effects:Online surveillance and wikipedia use,”Berkeley TechnologyLaw Journal,vol.31,no.1,pp.117182,2016.44 E.Morozov,The net delusion:The dark si
243、de of Internet freedom.PublicAffairs,2012.45 S.Willison,“You cant solve ai security problems with more ai,”2022.46G.Hadfield,M.-F.T.Cullar,and T.OReilly,“Its time to create a national registry for largeai models,”2023.47D.Leung,B.Nolens,D.W.Arner,and J.Frost,“Corporate digital identity:No silver bul
244、let,but a silver lining,”June 2022.48R.Zwetsloot and A.Dafoe,“Thinking about risks from ai:Accidents,misuse and structure,”Lawfare.February,vol.11,p.2019,2019.49J.Ding and A.Dafoe,“Engines of power:Electricity,ai,and general-purpose,military transfor-mations,”European Journal of International Securi
245、ty,vol.8,no.3,p.377394,2023.50S.Shoker,A.Reddie,S.Barrington,M.Brundage,H.Chahal,M.Depp,B.Drexel,R.Gupta,M.Favaro,J.Hecla,et al.,“Confidence-building measures for artificial intelligence:Workshopproceedings,”arXiv preprint arXiv:2308.00862,2023.51 C.Sterbenz and R.Trager,“Autonomous weapons and coer
246、cive threats,”2019.52M.C.Horowitz,“When speed kills:Lethal autonomous weapon systems,deterrence andstability,”Journal of Strategic Studies,vol.42,no.6,pp.764788,2019.53D.Hendrycks,M.Mazeika,and T.Woodside,“An overview of catastrophic ai risks,”arXivpreprint arXiv:2306.12001,2023.54A.Askell,M.Brundag
247、e,and G.Hadfield,“The role of cooperation in responsible ai development,”arXiv preprint arXiv:1907.04534,2019.55J.Schneider,“The capability/vulnerability paradox and military revolutions:Implications forcomputing,cyber,and the onset of war,”in Emerging Technologies and International Stability,pp.214
248、3,Routledge,2021.56D.Hendrycks,“Natural selection favors ais over humans,”arXiv preprint arXiv:2303.16200,2023.2257E.Tsetsi and S.A.Rains,“Smartphone internet access and use:Extending the digital divideand usage gap,”Mobile Media&Communication,vol.5,no.3,pp.239255,2017.58J.Munga,“ToCloseAfricasDigit
249、alDivide,PolicyMustAddresstheUsageGap.”https:/carnegieendowment.org/2022/04/26/to-close-africa-s-digital-divide-policy-must-address-usage-gap-pub-86959,2022.Accessed:11-14-2023.59B.Schneier,“Artificial intelligence and the attack/defense balance,”IEEE security&privacy,vol.16,no.02,pp.9696,2018.60J.S
250、andbrink,H.Hobbs,J.Swett,A.Dafoe,and A.Sandberg,“Differential technology develop-ment:A responsible innovation principle for navigating technology risks,”Available at SSRN,2022.61R.Bommasani,K.A.Creel,A.Kumar,D.Jurafsky,and P.Liang,“Picking on the same person:Does algorithmic monoculture lead to out
251、come homogenization?,”2022.62J.Kleinberg and M.Raghavan,“Algorithmic monoculture and social welfare,”Proceedings ofthe National Academy of Sciences,vol.118,no.22,p.e2018340118,2021.63A.Zou,Z.Wang,J.Z.Kolter,and M.Fredrikson,“Universal and transferable adversarialattacks on aligned language models,”arXiv preprint arXiv:2307.15043,2023.23