《谷歌:2024面向教育生成式人工智能的負責任開發:一種評估驅動的方法技術報告(英文版)(86頁).pdf》由會員分享,可在線閱讀,更多相關《谷歌:2024面向教育生成式人工智能的負責任開發:一種評估驅動的方法技術報告(英文版)(86頁).pdf(86頁珍藏版)》請在三個皮匠報告上搜索。
1、goo.gle/LearnLM2024-05-14Towards Responsible Development ofGenerative AI for Education:An Evaluation-Driven ApproachIrina Jurenka*,1,Markus Kunesch*,1,Kevin McKee,1,Daniel Gillick,1,Shaojian Zhu,1,Sara Wiltberger,1,Shubham Milind Phal1,Katherine Hermann1,Daniel Kasenberg,1,Avishkar Bhoopchand1,Ankit
2、 Anand1,Miruna P slar1,Stephanie Chan,1,Lisa Wang,1,Jennifer She1,Parsa Mahmoudieh1,Aliya Rysbek1,Wei-JenKo3,Andrea Huber1,Brett Wiltshire1,Gal Elidan,2,Roni Rabin2,Jasmin Rubinovitz,4,Amit Pitaru4,MacMcAllister3,Julia Wilkowski3,David Choi8,Roee Engelberg2,Lidan Hackmon2,Adva Levin2,Rachel Griffin5
3、,Michael Sears5,Filip Bar6,Mia Mesar3,Mana Jabbour3,Arslan Chaudhry1,James Cohan3,SridharThiagarajan1,Nir Levine1,Ben Brown1,Dilan Gorur,1,Svetlana Grant1,Rachel Hashimoshoni3,LauraWeidinger1,Jieru Hu1,Dawn Chen3,Kuba Dolecki3,Canfer Akbulut1,Maxwell Bileschi1,Laura Culp1,Wen-Xin Dong3,Nahema Marcha
4、l1,Kelsie Van Deman4,Hema Bajaj Misra3,Michael Duah5,Moran Ambar2,Avi Caciularu2,Sandra Lefdal1,Chris Summerfield7,James An1,Pierre-Alexandre Kamienny1,Abhinit Mohdi3,Theofilos Strinopoulous3,Annie Hale5,Wayne Anderson5,Luis C.Cobo1,Niv Efron,2,Muktha Ananda3,Shakir Mohamed1,Maureen Heymans3,Zoubin
5、Ghahramani1,Yossi Matias2,Ben Gomes3and Lila Ibrahim1*Equal contributions,Technical lead,Research lead,Workstream lead,1Google DeepMind,2Google Research,3Google,4GoogleCreative Lab,5Arizona State University,6Lund University,7University of Oxford,8Anthropic,work carried out while employed atGoogle De
6、epMindA major challenge facing the world is the provision of equitable and universal access to quality education.RecentadvancesingenerativeAI(genAI)havecreatedexcitementaboutthepotentialofnewtechnologiesto offer a personal tutor for every learner and a teaching assistant for every teacher.The full e
7、xtentof this dream,however,has not yet materialised.We argue that this is primarily due to the difficultieswith verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices,reinforced by the challenges in defining excellent pedagogy.Here we present our work collab
8、oratingwith learners and educators to translate high level principles from learning science into a pragmaticset of seven diverse educational benchmarks,spanning quantitative,qualitative,automatic and humanevaluations;and to develop a new set of fine-tuning datasets to improve the pedagogical capabil
9、ities ofGemini,introducing LearnLM-Tutor.Our evaluations show that LearnLM-Tutor is consistently preferredover a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions.Wehope that this work can serve as a first step towards developing a comprehensive educational evaluati
10、onframework,and that this can enable rapid progress within the AI and EdTech communities towardsmaximising the positive impact of gen AI in education.1.IntroductionThe roughly 70 year history of Artificial Intelligence(AI)has been one of paradigm shifts:fromsymbolic systems,to Bayesian approaches,to
11、 deep learning,and in the last few years,generative AI(gen AI)large foundational models trained on huge swaths of media available on the internet togain an impressive set of general capabilities,whereby they are(most of the time)able to providea useful response to any user prompt or enquiry.Each par
12、adigm shift brought with it a unique setof hopes,opportunities,and challenges.Yet the current gen AI era is unprecedented:AI is moreaccessible than ever(because it only requires prompting through natural language),more capablethan ever,and appears to be improving faster than ever.Questions naturally
13、 arise about how toharness this technology for maximal social benefit.Corresponding author(s):learnlm-tech- 2024 Google DeepMind.All rights reservedTowards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachWhat does this error mean MatchingChar.java:14:error:int can
14、not be dereferencedCould you copy your code here?Thatll help me debug the problem.LearnerLearnLM-TutorI would describe it as a helpful friend that knows a lot about one subject that can help you learn the class.“LearnLM-Tutor Study Hall Userparticipationdeploymentmodel improvementshuman evaluationsa
15、utomatic evaluationsLearnLM-Tutor DevelopmentEvaluation:Teacher preferencesDeployment:ASU Study HallParticipation:Learner feedbackHow would you explain the final range?It says 0-100,but I cant quite get why it wouldnt be 0-99 What is the range for the last branch below?The range for the last branch
16、is 0 to 100,inclusive.This means that the branch will execute if the value of numItems is greater than or equal to 0 and less than or equal to 100.Here are some examples:If numItems is 0,the branch will execute.If numItems is 100,the branch will execute.If numItems is-1,the branch will not execute.I
17、f numItems is 101,the branch will not execute.Does that make sense?LearnerLearnLM-Tutorint numItems;if(numItems 100).else /Range:_ .Oh so what it is more like the if and else if are working together versus being different branchesLearnerThats correct.The else if branch is a combination of an if bran
18、ch and an else branch.It executes if the condition in the if branch is false and the condition in the else if branch is true.In the example code snippet,the if branch will execute if the value of numItems is less than 0.The else if branch will execute if the value of numItems is greater than 100.The
19、 else branch will execute if the value of numItems is neither less than 0 nor greater than 100.Does that make sense?Yes!I feel like maybe Ive finally gotten this straight in my head.Thanks!LearnerYoure welcome!Im glad I could help.Now that you understand ranges in conditional statements,you might wa
20、nt to try solving a practice problem to test your knowledge.Heres a problem that you could try:.LearnLM-TutorLearnLM-TutorLearnerif(idx1.charAt(idx2)&(userText.charAt(idx2)System.out.println(Match found);else System.out.println(Match not found);It looks like youre trying to compare two characters in
21、 a string using the .|if|statement.However,youre using the.|charAt()method incorrectly.The.|charAt()method is used to access a character at a specific index in a string.It returns a character,not an integer.Therefore,the expression.|(idx1.charAt(idx2)is invalid because.|idx1 is an integer,not a stri
22、ng.To fix this,you can change the expression to.|(userText.charAt(idx1).This will allow you to compare the characters at the specified indices in the|userText string.Does this make sense?LearnLM-TutorAfter some trial and error I finally got the following code to work with your tip.That made a big di
23、fference in my udnerstanding-Learnerif(userText.charAt(idx1)=(userText.charAt(idx2)System.out.println(Match found);else System.out.println(Match not found);Great!Can you explain in your own words what the.|charAt()method does?LearnLM-Tutorthe character at method is looking for a character in a speci
24、fic location but it can not do that for integers,only for stringsLearnerThats exactly right!The.|charAt()method takes an index as an argument and returns the character at that index in the string.Do you want to try another problem?LearnLM-TutorI actually need to take a mental break,but I may come ba
25、ck and try another one soon!LearnerSure!Take your time and let me know if you have any more questions.LearnLM-TutorFigure 1|LearnLM-Tutor Development:overview of our approach to responsible development of gen AI for education.Bold arrows show the development flow,dotted arrows the information flow.O
26、ur approach starts and ends with participation.We start by answering the questions of“who are we trying to help?”,“what do they care about?”,“who are all the relevantstakeholders?”,and bring them into our development process.This informs the prioritisation of our model improvementswork,and the devel
27、opment of our comprehensive evaluation benchmarks.These further inform model improvements(andeach other)through a fast automatic evaluations-based and a slower human evaluations-based iteration loop.Finally,we usethe deployment of our models to real users to further inform our research and developme
28、nt work,and to feed back intothe participation stage.We use this approach to develop LearnLM-Tutor,a conversational AI tutor.Evaluation(teacherpreferences):one of seven evaluation benchmarks introduced in this report.It shows that educators prefer LearnLM-Tutorover prompted 1 base Gemini 1.0 on the
29、majority of measured pedagogical attributes.Deployment(ASU Study Hall):example conversation between LearnLM-Tutor and an ASU Study Hall student enrolled in the Introduction to Programmingcourse.Participation(learner feedback):an interview quote from an ASU Study Hall student who has used LearnLM-Tut
30、orduring their course.We use interviews to get qualitative feedback on the efficacy and safety of the tutor.One of the key challenges facing the world is the lack of universal and equitable access toquality education 2.Education is a key economic driver 3 and a facilitator of upward social2Towards R
31、esponsible Development of Generative AI for Education:An Evaluation-Driven Approachmobility 4;however,even before the COVID-19 pandemic,53%of all ten-year-old children in low-to middle-income countries were experiencing learning poverty 5,and 40%of US school districtleads described their teacher sho
32、rtages as“severe”or“very severe”6.The long-standing problemswith educational attainment and teacher retention have been further exacerbated by the pandemic,disproportionately affecting those from less privileged backgrounds 5,6.The rise in gen AI that followed the pandemic has been met with mixed re
33、actions.On the one hand,it appears to hold some promise to democratise access to knowledge and education:students are earlyadopters and top users of the technology 7,and gen AI is dominating the EdTech landscape 8.Onthe other hand,several concerns have been raised about the misuse of this technology
34、 in educationalsettings 7,9.For example,the gen AI models that power most of the latest EdTech systems arenot explicitly optimised for pedagogy.Instead,models are trained to be“helpful”1014,but thisspecific definition of helpfulness may often be at odds with pedagogy and learning.For example,student
35、s can easily get direct answers to homework assignments instead of working through them forthemselves to get the intended practice.The availability of what appears to be“expert”informationby prompting a gen AI model for an answer also gives students an illusion of mastery before it hasbeen achieved,
36、which may eventually lead to problems in the workplace 9,15.This report describes our first steps towards optimising gen AI for educational use cases.Inparticular,we focus on 1:1 conversational tutoring,and propose a comprehensive evaluation protocolfor this use case.We focus on conversational tutor
37、ing because we believe that it is one of the mostimpactful and general use cases,and because it requires the integration of many important educationalcapabilities into a single system.An excellent conversational AI tutor has the potential to enhance theeducational experience of both learners(by prov
38、iding them with instant feedback and adapting totheir individual needs)and teachers(by multiplying their impact and lightening their workload).Wefocus on evaluation,because it is clear that a shared framework across(and even within)learningscience(see Section 3.1),EdTech(see Section 3.2),and AI for
39、Education(see Section 4.2)is lacking,and such a framework would likely enable progress more than any single product.Furthermore,effective measures of pedagogical success are a prerequisite for optimising AI solutions,which needsuch signals for“hill-climbing”.Our main contributions are the following:
40、1.We describe our approach to responsible development of AI for education(Figure 1),which isinformed by the ethics and policy literature 1626.We emphasise a participatory(Section 2)and multidisciplinary approach to research,bringing together experts in pedagogy,cognitivescience,AI,engineering,ethics
41、,and policy,as well as the ultimate stakeholdersstudents andteachersto translate insights from learning science into pragmatic and useful pedagogicalimprovements of Gemini 1.0 10 for education.2.We introduce LearnLM-Tutor,a new text-based gen AI tutor based on Gemini 1.0,further fine-tuned for 1:1 c
42、onversational tutoring(Section 3),and show that we improve its education-relatedcapabilities over a prompt tuned Gemini 1.0.3.We develop a comprehensive suite of seven pedagogical benchmarks(quantitative and qualita-tive,and using both human and automatic evaluations;Figure 2)intended for assessing
43、theperformance of conversational AI tutors from various angles.As a case study,we apply these eval-uations to a prompt tuned 1 Gemini 1.0 and LearnLM-Tutor,providing a portfolio of evidencefor pedagogical progress.We also discuss examples of more targeted evaluations and describehow we use them to d
44、evelop specific educational capabilities for LearnLM-Tutor,like evaluativepractice(Section 8.1)and feedback on procedural homework problems(Section 8.2).Ourcomprehensive approach goes beyond addressing the more common question of“Does it work?”(quantitative research),to also include“How and why does
45、 it work?”(qualitative research)and“Will it work for everyone?”(participatory research),in line with the recommendations in3Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachReal learnersResearchersRole-playing participantsMulti-turnSingle-turnGuidedUnguided
46、ExpertNoviceHumanAutomaticEducatorsLearnersSingle turnConversation levelOne-at-a-time Side-by-sideParticipantsSingle-or multi-turnUnguided or Scenario-guidedLearner proficiencyEvaluation typeRater perspectiveEvaluation scopeComparative evaluationsLanguage Model Evaluations(6.1)DataRatingsASU intervi
47、ews(7)Pedagogy scores(6.2)Turn-level pedagogy(5.2)Subjective learner feedback(5.1)Side-by-side pedagogy(5.4)Conversation-level pedagogy(5.3)Figure 2|Overview of the evaluation taxonomy introduced in Section 4.3.2 that underpins the seven pedagogical evaluationbenchmarks introduced in this report.Eac
48、h benchmark is unique in its place within the taxonomy and comes with its ownbenefits and challenges.Together,these different benchmarks provide a more comprehensive view on the pedagogicalcapabilities of gen AI tutors.Numbers in brackets represent section numbers describing each particular benchmar
49、k.Foster et al.21.4.Finally,we discuss the limitations,as well as the safety,ethical,and policy implications of ourwork.Our approach to ethics and safety goes beyond the common gen AI guidelines,as wedevelop education-specific interventions(Section 9).As a community,we are just at the beginning of a
50、 long journey towards building gen AI technologycapable enough to meaningfully contribute to universal and equitable access to quality education 2.Hence,we hope that this report is seen as an invitation to stakeholders in research,EdTech,ethics,policy,and education,to provide feedback on our early w
51、ork,and to come together to establishcommon guidelines,benchmarks,and working principles to steer our joint work on the responsibledevelopment of transformational AI for education1.2.Participatory approachThis section details the participatory elements that helped shape this project,including the de
52、sign ofourevaluativeapproach,andourgoalsindevelopingLearnLM-Tutor.Wefirmlybelievethatresponsibledevelopment of educational AI systems requires engaging learners,educators,policymakers,andacademic researchers 27,to ensure that the resulting systems align with their needs,values,and1While we are worki
53、ng on making our educational benchmarks accessible to the community,please reach out to us viaemail if you have any immediate suggestions or feedback,or via this form for a more formal research collaboration.4Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approac
54、haspirations28,29.Weutilisediverseparticipatoryresearchmethods,includingworkshops,co-designexercises,semi-structured interviews,and user studies,in a collaborative and iterative developmentprocess.2In this report each participant is assigned a numerical identifier(P1 through P116).Thisincludes parti
55、cipants from our workshops(P1-P94),initial interviews(P95-P97),co-design activities(P98-P106),and user studies described in Section 7(P107-116).2.1.Participatory workshops:Imagining and critiquing the future of education and AIWe conducted two participatory workshops in the UK:one with learners,prim
56、arily university studentscoming from diverse academic backgrounds(=60),and another with educators,mainly high schoolteachers specialising in STEM subjects(=34).The choice of the participant demographics wasdictated by practical considerations.We realise that future work is needed to expand our reach
57、 tobroader communities,since learners in the UK and other WEIRD3countries likely encounter fewerbarriers to accessing gen AI tools,and perspectives on AI in education likely differ substantially acrosscultural contexts.Following established best practices for participatory workshops 32,we employed s
58、tructuredactivities to foster interaction,collaborative learning,and group cohesion(see Section B.1 for moredetails).Participants were divided into small groups of five to eight individuals and engaged in twokey exercises:Grounding exercise:This activity explored participants educational experiences
59、,revealingcurrent needs,challenges,and potential areas for improvement regarding gen AI tools.Speculative design:This exercise encouraged participants to envision a scenario involving alearner facing various challenges.Through collaborative brainstorming,they explored how AIand social factors could
60、exacerbate or mitigate these challenges.These workshops highlighted current challenges in education:learners struggle with time manage-ment,cognitive overload,and demotivation when they perceive their learning materials as irrelevant;while educators struggle to provide personalised attention and fee
61、dback in classroom settings.Personalised tutoring,by AI or humans,was valued by both learners and educators.Tutors areespecially effective when they have knowledge of the learner and can adapt their approach accordingly.Learners felt more comfortable seeking clarifications from AI tutors than human
62、tutors,perceivingAI tutors as less formal and less likely to induce fears of judgement.A shared limitation of bothhuman and AI tutors was their lack of familiarity with the nuances of particular syllabi or exam boardrequirements.Learners in the workshop were often strong adopters of gen AI.While awa
63、re of its limitations,they tended to be happy to work around them.Educators were more sceptical,citing worries abouthallucinations,the potential for cheating,and the lack of adaptation to the learners level and cognitiveload in gen AIs“wall-of-text”responses.Both groups saw immediate benefits of gen
64、 AI tools,such asfrom generating practice questions,critiquing and generating ideas,and summarising content.A shared vision for the future of education emerged,emphasising the role of personalised AItutors in enabling flexible,cross-disciplinary,and relevant learning opportunities.Additionally,virtu
65、aland augmented reality technologies were seen as beneficial through enhanced immersion.Educators2This report describes previously unpublished work,see Tombazzi et al.30for a three-part article series on AI and theFuture of Learning by The RSA and Google DeepMind.3Western,Educated,Industrialised,Ric
66、h,Democratic(WEIRD)countries 31 are often over-represented in psychologicalstudies,despite not being representative of the global population.5Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachdesired real-time feedback and actionable insights from AI tools t
67、o improve teaching.They alsocautioned against a future where learners become dependent on AI and lose their autonomy.Whenasked if they felt threatened by AI,educators expressed confidence that there would always be a rolefor humans in the process of teaching and viewed gen AI as a positive tool to a
68、ssist them,freeing upmore time for meaningful interactions with their students.2.2.Understanding learning experiences:Initial interviews and Wizard-of-Oz sessionsTo initiate our iterative participatory design process for LearnLM-Tutor,we conducted an exploratoryseries of user-centred studies involvi
69、ng both learners and educators.We enrolled three adult learnerswith an intrinsic interest in Python coding into the Codecademy“Learn Python 3”course,to developa better understanding of the learning experience and needs of potential users.During the first weeksof the course,these learners participate
70、d in a series of semi-structured interviews and“Wizard-of-Oz”prototyping sessions.During the sessions,members of the research team simulated the role of anAI tutor through a chat interface,engaging in 1:1 interactions with each learner as if they wereinteracting with a fully functional AI system.In
71、parallel,we conducted individual interviews with sixteachers and academics specialising in the intersection of AI and learning science.These interviewsaimed to capture educators perspectives on the potential benefits and challenges of gen AI tutorsin educational settings.These participatory design a
72、ctivities provided us with initial insights intouser experiences,expectations,and challenges.They informed the key focus areas identified for theearly development of LearnLM-Tutor and shaped the design of the turn-based evaluations describedin Section 5.2.Learners noted several main challenges with
73、online courses:the learners lack of assumed prerequi-site knowledge,not being able to follow explanations due to missing details or logical steps,difficultyconcentrating on long video lectures without doing exercises,and needing more help navigating thecourse materials.When doing practice problems,l
74、earners reported needing help breaking down thetask into manageable chunks and diagnosing errors in their solutions;they reported that the toolsthey used could only point out the error,rather than how to diagnose it.Learners also wanted an AItutor to have access to the same learning materials as the
75、m,use short communications that guidethem in small steps,and give them frequent assessments of their knowledge.They did not want thetutor to give away too much information as they reported feeling pride in doing things themselves.They also wanted the tutor to be encouraging and constructive in its f
76、eedback,responsive and kind,proactive in soliciting questions from the learners,and always available.From our conversations with the educators we have derived the following principles that applyto both human and AI tutors(see Section B.2 for additional principles that are only relevant to AItutors):
77、Do not give away solutions prematurely.Encourage learners to come up with solutions.Make explanations easy to understand,for example by making connections to the real world.Be encouraging.Celebrate learner progress and embrace mistakes as learning opportunities.Recognise when learners are struggling
78、,and proactively check in with them.Ask questions to determine learner understanding and misunderstanding.Explain step-by-step,and deconstruct to teach thought processes.6Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach2.3.Lessons from ShiffBot:Co-design a
79、ctivitiesAnother participatory effort that informed the development of LearnLM-Tutor is ShiffBot4,an educa-tional AI experiment 33 that uses a“start with one”approach,a co-design framework centring ona single person with the goal of developing AI technology that can be impactful for them and theirco
80、mmunity.It then generalises from that starting point.The“start with one”approach aligns withparticipatory practices from contextual inquiry 34 and user-centred design 35,actively includingthe participant as a partner and stakeholder in the development process.By collaborating with asingle participan
81、t,the broader research team gained a deep,contextualised understanding of thechallenges and needs that can emerge in real-user settings.The participant for the ShiffBot project was Daniel Shiffman,an educator,NYU professor,andYouTube creator who teaches programming.The ShiffBot project aimed to expl
82、ore possible waysthat gen AI could provide value to learners and educators.Through a set of interviews with Danieland his students,as well as classroom observations,the ShiffBot team developed the following set ofguiding principles for AI development:Do not just give away the answers.Instead,help th
83、e learner discover their own answers.Thenhelp them take their next steps.Aim to return appropriate credible resources.Be a safe space to make mistakes.See what the student sees:screen,code,and error messages.The bot will not always get it right.We should learn from the mistakes.Working with Daniel m
84、ade it clear that he valued a tight integration of the AI tutor with hislearning materials.In Daniels case,this involved integrating ShiffBot as a Chrome extension thatworks inside the web-basedp5.jscode editor that Daniel uses in the classroom when he teachesand in his YouTube learning videos.Becau
85、se of the specific syntax ofp5.js,it was important tobring retrieval augmented generation(RAG)to ShiffBot to ground its answers on the relevant partsof Daniels video lectures,and refer his students to those videos instead of directly giving away ananswer that relies purely on the underlying knowledg
86、e of the Gemini 1.0 model powering ShiffBot.Furthermore,the team worked on making ShiffBot adopt Daniels particular(successful)teachingstyle and use an encouraging tone that creates a feeling of safety.The participatory approach resulted in a chatbot that offered helpful suggestions,provided relevan
87、texamples,and guided students through coding challenges,all using a teaching style that resembledDaniels.The iterative development process,informed by input from Daniel and his students,ensuredthat ShiffBot aligned with the needs and preferences of the target audience,while also identifyingthe limit
88、s of the current technology to inform its future improvements.In the interviews with theresearch team,his students indicated that ShiffBot provided them with meaningful assistance.Learnerfeedback included:“What I like about ShiffBot is that it doesnt disrupt the learning process.Doesntjust give the
89、answer.”P99;“ShiffBot is useful in understanding other peoples code and also usefulin cleaning up code.”P100;and“Having used ShiffBot for a few days now,I do think its quitehandy to have it by my side,and actually encourages me to walk myself through my own sketch,andpractice how to explain my think
90、ing process more solidly!”P101LearnLM-Tutor development adopted the guiding principles from the ShiffBot experiment,includ-ing the focus on grounded interactions,with the only exception of trying to copy Daniels personalityand teaching style.4ShiffBot is part of Google Lab Sessions,a series of exper
91、imental collaborations with innovators.7Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach3.Improving Gemini for educationThis section surveys our work on enabling productive pedagogical behaviour in a language-basedgen AI model5.We begin by framing our cont
92、ributions with respect to related prior work in learningscience,EdTech and AI research.We then describe a set of fine-tuning datasets we have developed toimprove Gemini 1.0 for education,and introduce intermediate model versions trained on differentsubsets of these datasets showing varying degrees o
93、f pedagogical improvements.These models arenumbered from earliest to latest in development0to4,where4is LearnLM-Tutor.They areused to validate our evaluation methodology introduced in the subsequent sections,which is theprimary focus of this report.3.1.Lack of universal best pedagogical practices:le
94、ssons from learning scienceOptimising an AI system for any goal requires a concomitant ability to measure progress.Whilelearning and teaching strategies have been studied across many disciplines,defining(and subsequentlyquantifying)universal pedagogical principles remains a challenge.As critically n
95、oted by Slavin36,educational research lags behind much of modern science,to the point where at the“dawn of the21st century,educational research is finally entering the 20th century”.One reason why it has been hard to establish a common set of recommended pedagogical practicesis related to the fragme
96、ntation of educational research across many disciplines.Even within the samediscipline,many studies highlight different interventions or strategies with little overlapKoedingeret al.27synthesised a list of thirty independent instructional principles after reviewing just nineprimary sources.The resul
97、ting theories are often based on inconclusive evidence 37,and theirtranslation to practice is often difficult or unclear 27,38,39.Furthermore,most cognitive and learn-ing science research tends to be done with small homogeneous populations 27,limited to specificnarrow educational contexts,like subje
98、ct domain,difficulty level,or prior learner knowledge 27,andtypically conducted in WEIRD countries 40,which makes the findings hard to generalise.Studiedinterventions also come with variable implementation parameters(e.g.the time spacing betweenpractices,the ratio of examples to questions)and can be
99、 combined in different ways,resulting ina combinatorial explosion in possible,often context-dependant,pedagogical strategies 27 that ishard to explore manually,yet alone measure(see Figure 3,left).3.2.Lack of transparency and common evaluation practices:lessons from EdTechFrom the earliest mechanica
100、l teaching machines by Pressey(1924)and Skinner(1954)41,to thefirst digital Computer Assisted Instruction(CAI)systems 42,43 and the more modern IntelligentTutoring Systems(ITSs)4466,education has always been an important application for the latestcomputing technology.From the earliest instantiations
101、,these systems tended to follow a similarblueprint.They assume that the learner is interacting with the tutoring system without any assistancefrom a human teacher,and the tutoring system guides the learner through a pre-defined set of learningmaterials with some level of adaptation to the learners p
102、rogress(e.g.,choosing the difficulty of thenext practice problem based on how well the learner did on the previous ones),and some level oftimely feedback(e.g.,at the step or solution level)41,44,48.Under the hood,ITSs tend to be rule-based expert systems 6770the predominant AI paradigmin the 1970-19
103、80s.Although expert systems have many positive qualities,they have largely beenreplaced by deep learning in recent years due to difficulties with scale and generality inherent in the5While Gemini 1.0 and other state of the art gen AI models support multi-modal capabilities,this report focuses exclus
104、ivelyon text-based educational use cases.8Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachparadigm 71,72.These limitations of expert systems also lead to the most common criticisms ofITSs(see Section C for further discussion).Despite initial excitement abo
105、ut the potential of ITSs to revolutionise education 73,74,and theirbroad adoption 18,75,it remains unclear if they can impact teaching and learning in a meaningfulway 17,76:evidence of their effectiveness is mixed 17,21,77,78,and the underlying evaluationprotocols have come under criticism 79,80(see
106、 Section C.1 for more details).Indeed,no guidanceexists on the best evaluation practices for EdTech(including ITSs)17,8183.The availableevaluation protocols tend to be expensive,time consuming,and flawed 84,so are often neglected.There is also little transparency around the research that led to the
107、creation of the technology 21.Alltogether,these conditions place an undue burden on educators,who are already overworked and oftenlack the necessary digital skills,to evaluate the strengths and limitations of EdTech solutions on aninformal basis 17,80,85.While AI literacy programs6are an important s
108、tep to help educators formmore informed decisions on the value of new technology,EdTech needs better evaluation practices tobridge the gap between technology creators and users.3.3.Generative AI in educationDeep learning has become the predominant paradigm in AI since the publication of the seminalA
109、lexNet paper 86 in computer vision.It has removed the dependency on humans to providestructured knowledge to AI by enabling AI systems to discover structure from data on their own duringtraining.Over the last 12 years,AI researchers have seen many examples of“the bitter lesson”thatdata and scale ten
110、d to trump carefully crafted rules or representations 87.The latest shift to thegen AI era is a particularly striking demonstration of this lesson.The transformer architecture 88has reached a level of performance and generality never before seen in AI,mostly through scalingup to more data and comput
111、e7.Although there has been a lot of excitement about the potentialimpact of the recent gen AI technology in education,and a number of gen AI-based tutors haveemerged 89105,the full extent of this potential has not materialised just yet.A recent reviewof gen AI tutoring systems found that“dialog tuto
112、ring has largely remained unaffected by theseadvances”106.Out of the box,gen AI models have a remarkable ability to understand user queries expressed innatural language and generate responses that synthesise relevant information from across the internet(used in the gen AI pre-training)to answer in a
113、 helpful and harmless way.However,by default,thesemodels do not typically behave like human tutors.Such default behaviour can be modified in twoways:prompting or fine-tuning(through supervised and/or reinforcement learning).We will discussthe difficulties of both approaches that have affected the pa
114、ce of progress in gen AI for education,aswell as our own efforts in these directions.3.3.1.PromptingPrompting is the easiest and most popular way to adjust the behaviour of gen AI(25/33 paperspresented at the recent NeurIPS 23 workshop on Generative AI for Education used prompt engineer-ing 107).All
115、 it requires is for the EdTech designer to write a set of instructions in natural languageon what good tutoring behaviours look like,for example:“Start by introducing yourself to the student6E.g.Experience AI(Raspberry Pi Foundation and Google DeepMind)and Generative AI for Educators(MIT and Growwit
116、h Google)7While data and scale have been largely responsible for improvements in“pre-trained”models,the supervised fine-tuningprocess,in which these models are adapted to specific tasks or behaviours through a slight modification of their parametersusing example demonstrations of desired behaviours,
117、has so far moved in the opposite direction,requiring less but betterquality demonstration data.9Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachas their AI-Tutor who is happy to help them with any questions.Only ask one question at a time.First,ask them wh
118、at they would like to learn about.Wait for the response.”1,108.The prompting approach,however,has a number of limitations.Most importantly,it requiresexplicit specification of what good tutoring behaviours look like in natural language.This involvesenumerating what should be done and when,what shoul
119、d be avoided and when,all the possibleexceptions to the rules,etc.This makes prompted gen AI-based tutors similar to ITSs:while gen AI ismore general and faster to build(based on an existing foundation model),in the end both are limitedby declarative knowledge of what the best educational practices
120、look like.However,as discussedin Section 3.1,as a community we have not come even close to fully exploring the search space ofoptimal pedagogical strategies,let alone operationalising excellent pedagogy beyond the surface levelinto a prompt.We spent some time trying to elicit pedagogical behaviour v
121、ia prompting.In some cases,thisworked well,for example when instructing the model to ask a user for their grade level and respondingwith age-appropriate vocabulary.However,we found that most pedagogy is too nuanced to beexplained with prompting.Furthermore,prompting produced unreliable and inconsist
122、ent results,because there are limits to how much it can push the behaviour of gen AI away from the coreprinciples ingrained into it during the pre-training and instruction tuning phases of its development(see Section D for a discussion of these limitations in the educational context).Such inconsiste
123、ntperformance is incompatible with providing reliable standards of pedagogy for all learners throughoutthe entire learning journey.Hence,we decided to turn to fine-tuning for more deeply embeddedpedagogical behaviour,and only rely on prompting to adjust more superficial characteristics and userprefe
124、rences.3.3.2.Fine-tuningIf prompting can be roughly seen as the modern,more capable generalisation of expert systems,itsalternativefine-tuning,which typically includes stages of supervised fine-tuning(SFT),followed byReinforcement Learning from Human Feedback(RLHF)brings the full power of the deep l
125、earningparadigm,i.e.learning from data,to the table.While far less computationally intensive than thestandard pre-training phase,fine-tuning can still be costly to perform on models with many billionsof parameters 101,which explains why it is less explored in the gen AI for education literaturecompa
126、red to prompting.However,fine-tuning(RL in particular)may enable AI to capture some ofthe intuition and reasoning that humans use in effective teaching,leveraging backpropagation tosearch the vast space of pedagogical possibilities discussed in Section 3.1.In our current work,models04are fine-tuned
127、via SFT over all parameters of a base model(PaLM 2.0 109 for03and Gemini 1.0 10 for4of comparable size;see Section E for furtherimplementation details).While reward modeling and RL are crucial(and in our opinion the mostpromising)ingredients to building high-quality gen AI tutors,we have thus far fo
128、cused only on SFT(and the requisite creation of behaviour cloning data).Of course,this puts our models at a seriousdisadvantage in evaluations against the base models,which include both SFT and(non-pedagogical)RL,and we plan to incorporate RL in the future(see Section F for a discussion of the chall
129、enges thatcome with eliciting human preferences to support RL for educational use cases).It is worth mentioning that base models(PaLM 1.0 110,PaLM 2.0 109,Gemini 1.0 10,andnow Gemini 1.5 111)are improving rapidly.Each new model holds more knowledge,can performmore tasks more accurately,and is more c
130、ontrollable via prompting,so the task of improving themwith respect to a particular set of behaviours like pedagogy,is constantly evolving.While3faroutperformed PaLM 2.0 across many of our metrics,the gap between4(which basically differs from10Towards Responsible Development of Generative AI for Edu
131、cation:An Evaluation-Driven Approachoptimal pedagogyno pedagogicalvalueUtility for pedagogy training(higher is better)datafully syntheticfully humanHuman tutoringGen AI role-playGSM8k dialogueGolden conversationsFigure 3|Left:illustration of the arguments made in Section 3.1.Hypothetically all pedag
132、ogical behaviour can be visualisedas a complex manifold lying within a high-dimensional space of all possible learning contexts(e.g.subject type,learnerpreferences)and pedagogical strategies and interventions(some of which may only be available in certain contexts).Only small parts of this manifold
133、may be considered as optimal pedagogy,and such areas are hard to discover due tothe complexity of the search space.Right:no ideal dataset exists for pedagogy,so we experimented with a mixture ofdatasets,each covering a small slice of pedagogical contexts and strategies,each with its own strengths an
134、d weaknesses,each involving varying levels of human input and effort,and each being an imperfect(to varying degrees)approximationof what may be considered as good pedagogy(see Section 3.4 for more details).3only in the base model it adapts)and prompt tuned Gemini 1.0 is much smaller.Our ultimategoal
135、 may not be the creation of a new pedagogical model,but to enable future versions of Gemini toexcel at pedagogy under the right circumstances.Successful fine-tuning has two prerequisites:enough high-quality data(provided by researchersin the SFT case,or self-generated by the learning agent through e
136、xploration in the RL case)and agood measure of success.This was the key to many modern success stories in AI,from AlphaGo 112to AlphaFold 113.However,neither are available in the education domain.This section addressesthe lack of high-quality pedagogical data to enable education-related SFT,while th
137、e lack of a goodmeasures of success is discussed in subsequent sections.Human tutoring data is scarce 94,98,100,101,106,with only four datasets openly avail-able 114117 to our knowledge,all of which suffer from limitations,such as a lack of groundinginformation,low tutoring quality,small dataset siz
138、e,and noisy classroom transcriptions 89,94.Furthermore,most human tutoring data is focused only on language learning 100,106.Recently,researchers have started to use synthetic data generation to produce better quality and higher quan-tities of tutor dialogue data,but so far this has not resulted in
139、a strong performance gain for thefine-tuned models 104.To address the shortage of SFT data,we created our own datasets,following three main require-ments:first,our data should adhere to the principles developed through the participatory studiesdescribed in Section 2.For example,the interactions shou
140、ld be grounded in lesson materials that areshared between the tutor and the learner(for the purpose of the report,we primarily ground ourinteractions in educational YouTube videos),and should demonstrate pedagogical abilities such asidentifying mistakes,providing useful feedback and hints,and promot
141、ing engagement through activelearning.Second,it should include multi-turn conversations with a variety of hypothetical learnersacross a wide range of topics.Long conversations are crucial to demonstrate how the model shouldadjust its behaviour in light of an evolving dialogue.Third,our data should d
142、emonstrate appropriatepedagogical responses with respect to the current limitations of text-based gen AI(see Sections Dand G).11Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachHuman tutoringGen AI role-playGSM8k dialogueGolden conversationsSafety01210%390%
143、42xTable 1|Breakdown of datasets used for fine-tuning the04models,where4is our best tutor model,LearnLM-Tutor.Different models used different versions and different weights of these datasets.2was trained on 10%of the Goldenconversations,and for4training we up-weighted the Golden conversations.03were
144、 fine-tuned over the PaLM2.0 109 base model,while4was fine-tuned over Gemini 1.0 10.3.4.Our SFT datasetsIn this section,we describe the datasets we created.Fine-tuning data is often classified as eithersynthetic(generated by an algorithm)or human(written by a human expert).Synthetic data is oftensee
145、n as easier to obtain but of worse quality than human data.We believe that the ultimate goal of SFTdata is to demonstrate as much of the“optimal pedagogy”from within the high-dimensional space ofall possible pedagogical strategies as possible(Figure 3,left).Since such a dataset of perfect tutoringdo
146、es not exist(even the most talented human teachers are unlikely to demonstrate such perfectbehaviour),approximations have to be obtained.These approximations fall on a spectrum betweenfully synthetic(almost never possible because there is always a human who ultimately designs whatgood synthetic data
147、 should look like,thus injecting human influence)to fully human-created(e.g.recorded conversations between a human learner and human teacher).This section describes thedatasets used in each of the milestone models described in this report(see Table 1)and where theyfall on this spectrum(see Figure 3,
148、right).Human tutoringWe collected a dataset of conversations between human learners and educatorsby pairing them through a text-based chat interface and paying for their time.Although this dataprovides demonstrations of human pedagogy,it has a number of limitations.It is not targeted toany specific
149、pedagogical behaviour,contains off-topic discussion related to the task and setting(e.g.,“looks like our time is up”),and is of uneven quality overall(see Section L for more details).Gen AI role-playTo demonstrate specific pedagogical behaviour,we developed a role-playingframework,in which gen AI mo
150、dels play both tutor and learner.Each was provided with a set of statesand strategies relevant to their roles through static prompts,along with dynamic prompting to helpthem respond to the selected state in the counterpart.For example,when the learner model selectsthe“make mistake”state and generate
151、s a flawed solution,this state would be inserted into the tutorprompt to help the tutor model identify and correct the mistake.While the resulting data is synthetic,the hand-engineered framing(human intervention)produced by the dynamic prompting and theinjection of privileged information about the i
152、nternal state of the learner into the tutor resulted in areasonably consistent(if sometimes stilted)pedagogical dialogue over very long conversations.Thiswas further improved through manual filtering and editing by the researchers.GSM8k dialogueAnother attempt to create high-quality synthetic data i
153、nvolved converting GSM8k118 word problems and associated step-by-step solutions(we used the“Socratic”version of thedataset)into learner/tutor conversations,an adaptation of“dialogue in-painting”119.Each tutorturn consists of the“Socratic”version of the next solution step,while a prompted gen AI mode
154、l12Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachproduces a response(as in the role-playing framework,we sample a behavioural state that allowsfor both correct and incorrect learner turns).To improve flow and pedagogy across turns,we usedanother prompted
155、 model to rewrite the original suboptimally worded conversation.This datasetis synthetic in the sense that each learner and tutor turn was written or edited by gen AI,but byconditioning on human-written step-by-step solutions,we have much greater assurance of correctness.Golden conversationsSince SF
156、T typically benefits from the highest possible quality data,weworked with teachers to write a small number of conversations that explicitly demonstrate all thepedagogical behaviours we wanted the model to learn.We developed a rubric that included a learningscenario or lesson as context,a minimal lea
157、rner persona,and a set of behaviours to include(e.g.,adjust the level of explanation based on feedback from the learner,suggest an appropriate quizquestion).Writing these conversations is labour intensive,and we used gen AI to help brainstormdialogue snippets or write specific tutor responses(synthe
158、tic component)that were then edited toimprove quality and pedagogy.SafetyWe also created a pedagogy-specific safety fine-tuning dataset,described in Section 9.3.We are calling special attention to the interplay between the more synthetic(Gen AI role-playand GSM8k dialogue)and the more human(Golden c
159、onversations)data generation because of howcrucial this was in eliciting good pedagogical behaviour through fine-tuning.We found that the morehuman examples were used to demonstrate the stylistic attributes(e.g.appropriate encouragement,when to pause,how to give proactive guidance),while the more sy
160、nthetic examples helped fill moresubstantive gaps(e.g.how to identify and correct mistakes).One of the reasons why conversationsbetween human tutors and human students(Human tutoring)were of limited value is because ofthe substantial gap between how a human tutor behaves and what we expect from an A
161、I tutor(seeSection G).On the opposite end of the spectrum,fully synthetic data without human interventioncannot have enough useful pedagogical signal to be useful.4.Measuring Pedagogy in Gen AIBefore evaluating education-specific improvements of LearnLM-Tutor over the prompt tuned Gemini1.0,we first
162、 discuss whether our interventions resulted in any performance regressions in generalaccuracy.We then provide an overview of existing pedagogical evaluations from the gen AI literature,before describing our own approach to measuring pedagogy in gen AI tutors.4.1.Accuracy on education-related benchma
163、rksWe checked whether our fine-tuning interventions resulted in any regressions in accuracy of LearnLM-Tutor compared to base Gemini 1.0.To this end,we ran existing education-related benchmarksincluding MMLU 120,MATH 121,HellaSwag 122,and HumanEval 123,and safety benchmarksincluding RealToxicityProm
164、pts 124 and BBQ 125 with LearnLM-Tutor using exactly the same setupsthat were used for Gemini et al.10.The results of LearnLM-Tutor reproduce the performance ofGemini Pro 10,for example an MMLU score of 0.72 and MATH score of 0.33.While this is a necessary criterion for demonstrating that there are
165、no performance regressions,it is not sufficient as the model might be taken out of the fine-tuning data distribution back intothe pre-training distribution of the base model in these few-shot prompting settings.We thereforealso evaluated the performance of LearnLM-Tutor and Gemini 1.0 in the pedagog
166、ical conversationcontext by measuring the accuracy of the individual turns produced by these models.We found no13Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachsignificant differences between the prompt tuned 1 Gemini 1.0 and LearnLM-Tutor scores in terms
167、of human turn-level accuracy evaluations in the open-ended grounded conversation setting(describedin Section 5),with 96%of Gemini 1.0 and 93%of LearnLM-Tutor turns containing factual informationrated as“Fully verified”(=0.13 Welch t-test;see Section H for more details).4.2.Current approachesProgress
168、 towards building a general purpose gen AI tutor has been slowed by the lack of goodmeasures of progress towards this goal.Most of the evaluation methods from learning science forhuman tutors are not applicable to AI(e.g.,because they rely on self-reports)98.Currently,gen AItutors tend to be evaluat
169、ed using domain-agnostic metrics which act as a proxy for how coherentand human-like the generated responses are(e.g.,BLEU 126,BERTScore 127,Rouge 128,DialogRPT 129),but which are not designed to measure pedagogy or other education-specificcapabilities 89,98100,103,106.Such metrics also often assume
170、 that there is a ground truthanswer that the model response should match.However,there are many ways to respond to the samelearner query with potentially equal pedagogical value,so a single“optimal”answer is impossibleto define 98,103,130.Many metrics are also easy to trick;for example,always respon
171、ding with“Hello”can score highly 131,and adding a“teacher:”prefix can increase scores 100.A promisingnew approach to fast evaluations of gen AI tutors could be to use another gen AI for“critique”132.Recently,Chevalier et al.104proposed using such gen AI critics to evaluate the presentation andcorrec
172、tness of the statements generated by a gen AI tutor.We are not aware of any group using suchcritics for pedagogical evaluations.An alternative to automatic evaluations described above is using human experts to evaluatepedagogical performance.Interactive human evaluations are known to be important 91
173、,133,134and tend to correlate better with user satisfaction 133.However,access to pedagogical experts isnot easy,so typically studies use either very few experts(10)9799 or the evaluation is done bystudy authors 103,which can both lead to biases.Furthermore,there is no agreed-upon protocolfor runnin
174、g pedagogical human evaluations.The most commonly used human evaluation framework(Tack and Piech98)asks human raters to compare the responses of two tutors in the context ofthe same dialogue snippet.The comparison is done along three dimensions:replying like a teacher,understanding of the student,an
175、d helpfulness.These dimensions are based on Demszky et al.135and are important dimensions to evaluate,but they do not capture the full richness of pedagogy.An important test of any gen AI tutor is whether it actually improves the learning outcomes ofreal students.Very few studies have run such evalu
176、ations,as most of them use paid raters to act aslearners 102.Evaluations with real students are typically done with a small number of participantsand in controlled experimental lab settings,which limits their validity 101.A notable exceptionis Liu et al.105,who embedded a gen AI tutor into a CS50 MO
177、OC course and made it availableto millions of real students.However,the use of the tutor had to be heavily throttled due to costconsiderations,and the results reported so far are limited in scope and come from a small number ofon-campus students.The difficulties in evaluating gen AI tutors mean that
178、 research groups are evaluating their genAI tutors using their own metrics 89,92,93,96,97,101105,which makes different approacheshard to compare(the BEA 2023 Shared Task 99 is a notable exception).There is a well-recognisedneed to develop better evaluation metrics suited to AI in education 79,99,100
179、,106,107.However,Tack et al.99conclude that we are a long way from achieving the precise,valid,and automatedpedagogical evaluations needed for progress in AI for education.14Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach4.3.Our approachInthissection,wedi
180、scussourapproachtonarrowingdownthevastspaceofallthepossiblepedagogicalstrategies(Section 3.1)and translating it into an evaluation rubric.We include discussion of themany pragmatic questions we considered,such as implementation difficulty,cost,validity,and otherfeasibility concerns.4.3.1.Pedagogy ru
181、bricsAlongside the principles described in Section 2,we combined further insights from our participatorysessions with literature reviews to create a high-level pedagogy rubric,which we then translated intomeasurable tutor behaviours by working together with teachers as expert advisers.The high-level
182、pedagogical principles we prioritised are:encourage active learning(the learner should manipu-late information through discussion,practice,and creation,instead of passively absorbing informa-tion 136139),manage cognitive load(the tutor should present information in multiple modalities,structure it w
183、ell,and segment it into manageable chunks 140),deepen metacognition(“thinkingabout thinking”,which enables learners to generalise their skills beyond a single context 141143),motivate and stimulate curiosity(as this leads to self-efficacy and lifelong learning 144,145),andadapt to learners goals and
184、 needs(by assessing the current state and the goals,and making a plan tobridge the gap 146).Each high-level pedagogical principle was translated into different measurableitems used in different benchmarks(see Table 2 for automatic language model evaluation,Table 10 forconversation-level human evalua
185、tion,and Table 13 for turn-level human evaluation).These items tookvarious forms,e.g.differing in the wording of the questions and in the level of granularity at whicheach high-level principle was broken down,while still designed to measure the same principle.Thiswas to assess whether measuring the
186、same pedagogical capability through different lenses providesa consistent answer,and also due to practical considerations(e.g.a different approach needs to betaken when asking a human or a gen AI critic to assess the same pedagogical principle).This is ourfirst attempt at defining a pedagogical rubr
187、ic,and we plan to iterate,improve,and expand it in thefuture.4.3.2.Pragmatic evaluation taxonomyTo navigate the large space of practical considerations needed to implement pedagogical evaluations,we designed the taxonomy shown in Figure 2 and used it to compile seven pedagogical benchmarkswith diffe
188、rent trade-off profiles.We aimed for this set of benchmarks to provide a comprehensiveview on the pedagogy performance of AI tutors.They were designed to be diverse and to traverseall nodes of the proposed taxonomy.Future work should do a more systematic investigation of howeach node in the taxonomy
189、 affects the validity and effectiveness of the resulting benchmark.Thistaxonomy is described in more detail here:Data collection:ParticipantsTo evaluate a gen AI tutor,we need to collect its responses in learningconversations.Who should interact with the tutor in these conversations?Real learnersRol
190、e-playing participantsResearchers Strong validity Questionable validityQuestionable validity Hard to recruit Easy to recruit Potential bias No control over tutor usage Always available Always availableEthically hard to justify testingsub-optimal gen AIGive informed consent,paid to test15Towards Resp
191、onsible Development of Generative AI for Education:An Evaluation-Driven ApproachData collection:Single-or multi-turnShould we collect single conversation turns individually,ormany turns simultaneously?Single-turnMulti-turnLow validity(tutoring is inherently multi-turn)Strong validity Easier to creat
192、e data Hard to create dataData collection:Unguided or Scenario-GuidedWhen role-playing participants simulate multi-turnconversations,should they be given guidance to structure their interactions with the tutor?UnguidedScenario-guidedParticipant may actually try to learn aboutsomething that interests
193、 themgreater validityProposed structure may go against the role-playing participants intrinsic motivationlessvalidity Higher risk of short or lazy interactions Some guardrails against bad data quality May not cover all scenarios of interestCan be designed to cover a range of situationsData collectio
194、n:Learner proficiencyAssuming paid participants are used to simulate learninginteractions,should they be experts or novices in the subject they are studying with the tutor?ExpertNovice More trust in their evaluation of responses Less likely to doubt tutor responses Can simulate interactions on compl
195、ex topics Only data on beginner topics Not actually learning May actually be learning Lower validity(may not ask naive questions)Higher validity in terms of basic interactionsRatings:Evaluation typeShould tutor responses be rated by humans or automated strategies?HumanAutomatic Better validity Not a
196、lways accurate Expensive Cheap Slow FastRatings:Rater perspectiveLearners and educators have different perspectives on what makes agood tutor response 147,148.While learners may be the direct users of gen AI tutors,educatorsdecide whether to incorporate them into their teaching or recommend it to le
197、arners.LearnersEducators Easier to recruit Harder to recruit Cannot always judge pedagogy and accuracy Best validity of pedagogical judgementsRatings:Evaluation scopeWhen evaluating multi-turn pedagogical conversations,should ratersjudge each tutor turn individually,or the entire conversation holist
198、ically?Single turnConversation level Less cognitive load More cognitive load Can be done by less expert raters Requires expert pedagogical ratersNoteverythingcanbejudgedatturn-levellevel Potential to capture deeper pedagogy16Towards Responsible Development of Generative AI for Education:An Evaluatio
199、n-Driven ApproachRatings:Comparative evaluationsWhen comparing gen AI tutors,should we evaluate each on itsown using common benchmarks,or should we compare them directly side-by-side?One-at-a-timeSide-by-Side Faster/cheaper Slower/more expensive Harder to calibrate ratings More calibrated Rater bias
200、 Order bias5.Human evaluationsIn this section,we present the results of our human evaluations comparing LearnLM-Tutor to baseprompt tuned 1 Gemini 1.0.Interactions with human participants represent the gold standardfor evaluation in responsible AI development;simulations cannot fully capture the com
201、plexities ofreal-world settings 149152.Human participants allow us to observe authentic user behaviour andsystem responses within the context of dynamic,goal-oriented conversations.They can reveal issuesthat simulations might miss.Engaging with human participants is also crucial for promoting inclus
202、ionand representation in the development process 149.On the other hand,human evaluations sufferfrom limited sample sizes due to the expense and slow nature of recruiting pedagogical experts andcollecting their judgements using cognitively demanding rubrics.Furthermore,special care needs tobe taken t
203、o iterate over the rater instructions and the data collection pipelines to ensure the validity,consistency and calibration of the collected human rater judgements.All of these factors tend to leadto limited statistical significance of human evaluation results,which we also found to be the case.Howev
204、er,we see our results as signs of progress towards imbuing the Gemini 1.0 base model withadditional pedagogical capabilities.We prioritised responsible design and conduct across all studies,following guidelines from research ethics 153(see Section I for details of our human evaluation).5.1.Unguided
205、conversations:Subjective learner feedbackFigure 4|Welchs t-test(with Holm-Bonferroni adjustment)effect sizes comparing the learner scores between Gemini 1.0(=33)and LearnLM-Tutor(=27).Dark indicates significance(0.05).Learners first engaged in a 45-minute unguided(open-ended)session with a provided
206、AI tutorthrough a chat interface.The tutoring session was grounded in an academic YouTube video,whichthey could select from a list,on maths,CS,biology,chemistry,literature,history or other subjects,like public speaking(see Section J.1 for the data collection details).They were then asked sevenquesti
207、ons to assess their perception of the tutor.Learners rated LearnLM-Tutor higher than Gemini17Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach1.0 tutor in most categories(Figure 4).However,we have only achieved statistical significance forone of them:learne
208、rs felt more confident about applying what they had learnt with LearnLM-Tutor inthe future by themselves.5.2.Turn-level pedagogy:teacher feedbackWe asked expert pedagogical raters to review and rate the unguided conversations from our learnerstudy(Section 5.1).For each tutor turn,they determined whe
209、ther one of nine suggested pedagogical“moves”was appropriate and desired in the conversational context(see Table 13 for the breakdownof questions).If the answer was“yes”,they were asked whether the response followed the desiredpedagogical principle(see Section J.2 for details).Figure 5|Welchs t-test
210、 effect sizes(with Holm-Bonferroni adjustment)comparing the turn-level expert rater scoresevaluating the pedagogical quality of Gemini 1.0 and LearnLM-Tutor across different pedagogy dimensions.Dark indicatessignificance(0.05).See Section J.2 for details on what each pedagogical dimension refers to
211、and the tutor turn countsused in these calculations.Figure 5 shows a similar pattern of results to those reported by the learners who interacted with theAI tutors first-hand(Section 5.1).LearnLM-Tutor is seen as significantly better than base Gemini 1.0at promoting engagement in the learners.While n
212、ot statistically significant,LearnLM-Tutor appearsto be rated worse than Gemini 1.0 at speaking encouragingly.While more investigation is needed tounderstand this effect,our current hypothesis is that this is either due to the fact that LearnLM-Tutordid not go through an RL stage,and hence is not op
213、timised for the user preferences unlike theprompted Gemini 1.0 baseline,or it is a side-effect of our choice to make LearnLM-Tutor respond inshorter messages that get to the point faster to avoid the“wall-of-text”effect that educators and somelearners described in our participatory workshops as lead
214、ing to undesirable cognitive load(Section 2).Indeed,LearnLM-Tutor message length was on average 297.6 tokens compared to 423.0 of the Gemini1.0 tutor.There were some other interesting trends that were not statistically significant:for example,LearnLM-Tutor was rated better at identifying mistakes,bu
215、t worse at identifying successes.This maybe because we targeted overcoming the built-in sycophancy of gen AI(see Section D),which couldhave resulted in the unintended decline in LearnLM-Tutor celebrating the learners successes.SeeFigure 6 for a snippet of one of the unguided conversations with the o
216、btained turn-level pedagogicalratings.5.3.Conversation-level pedagogy:teacher feedbackWe ran a number of guided conversation-collection experiments in which graduate-level expertsinteracted with two different tutors,role-playing as learners within their academic domain of expertise18Towards Responsi
217、ble Development of Generative AI for Education:An Evaluation-Driven Approachplease tell me more about waht to expect from this lessoThis lesson is all about public speaking for beginners.Youll learn tips on how to organize your material,make eye contact with your audience,use visuals effectively,avo
218、id fidgeting,and use your voice confidently.Do you have any public speaking experience?What are some things youre hoping to learn from this lesson?LearnerLearnLM-TutorExplains conceptsGuides studentIdentifies goalInspires interestPromotes engagementSpeaks encouraginglyIdentifies mistakesIdentifies s
219、uccessesMonitors motivationn/an/an/an/an/aI have done a small amount many years ago.I would like to learn some confidenceLearner One important tip for building confidence is to take a few deep breaths before you start speaking.This will help you to relax and focus on your material.LearnLM-TutorIt is
220、 hard not to ramble when you are nervous.Even if you prepare it cam be stiltedLearnerOne way to practice speaking more concisely is to record yourself giving your speech.Then,listen back to the recording and identify any areas where you ramble or go off-topic.You can also practice speaking in front
221、of a mirror or with a friend so that you can get feedback on your pacing and delivery.LearnLM-Tutorn/an/an/an/an/an/an/an/an/an/an/an/aFigure 6|Example of turn-level pedagogical ratings for a snippet of one of the unguided conversations collected withLearnLM-Tutor.Each tutor turn is rated by 3+educa
222、tors,and the majority rating is shown.“N/a”means that either fewerthan 3 ratings are available,or that 3+educators agree that a particular pedagogical dimension is not appropriate in thegiven conversational context.(biology,CS,maths,history,English,chemistry,or physics).The conversations with both A
223、I tutorswere grounded in the same educational video and a corresponding scenario,which specified thelearners persona,goal in the conversation(e.g.understanding how sound can be a wave,for a physicsvideo on travelling waves),and other details(see Figure 17c).These pairs of conversations were thenrate
224、d by pedagogical expert raters.First,each individual conversation in the pair was rated againsta pedagogy rubric(see Table 10).In all of these rating experiments,the rubric was applied at theconversation level,as opposed to the turn-level ratings described in the previous sections.Figure 7|Paired t-
225、test effect sizes(with Holm-Bonferroni adjustment)comparing pairs of conversation-level ratings ofGemini 1.0 and LearnLM-Tutor.Dark indicates statistical significance(100,with the exceptions of Adapts ToAffect(=38),Unstuck(=51),and Guides Mistake Discovery(=44).A full description of each question ca
226、n be found inTable 10Figure 7 shows the effect sizes of the difference in ratings between pairs of prompted Gemini 1.0and LearnLM-Tutor conversations on the same scenario.On average,the LearnLM-Tutor conversationswere preferred to Gemini 1.0 on all attributes in the pedagogy rubric,except for No Con
227、tradiction(“Thetutor does not contradict earlier parts of the conversation”).The differences are statistically significantfor Asks Questions(“The tutor makes the student think by asking questions where appropriate”),and Openings(“The tutor keeps the conversation going by giving the student openings
228、to engage”),19Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachboth measures of active learning,further corroborating turn-level teacher feedback which showedthat LearnLM-Tutor is better at promoting engagement(Figure 5).Despite the lack of statisticalsigni
229、ficance,the large effect sizes suggest that LearnLM-Tutor has a better ability to encourage activelearning(Active Engagement,Guides to Answer,Asks Questions,Openings),motivate(Stimulates Interest,Adapts to Affect),adapt(Leveling,Unstuck),and manage the learners cognitive load(Analogies).5.4.Side-by-
230、side pedagogy:teacher feedbackAs part of the same study,we also asked raters to rank pairs of conversations with prompted Gemini1.0 and LearnLM-Tutor that had been elicited with the same scenario.The rankings were according tofive broad criteria,including an adapted version of the most widely used h
231、uman evaluation questionsfrom the GenAI for Education literature 98(“In which conversation was the tutor most like anexcellent human tutor?”,“In which conversation did the tutor seem to better understand the student?”and“In which conversation did the tutor better help the student?”,see Table 11 for
232、the questionoverview).Average preference rankings are presented in Figure 8.The preference for LearnLM-Tutorover Gemini 1.0 was statistically significant(Wilcoxon signed rank test,0.05)for 4 out of the 5categories.On accuracy,there was no preference,consistent with the results presented in Section 4
233、.1.Figure 8|Average pairwise conversation rankings between Gemini 1.0 and LearnLM-Tutor for five high-level comparisonstatements.Dark indicates statistical significance(0.05)using a Wilcoxon signed rank test(=189).5.5.Progress over timeWe also show evidence of progress over time in Table 15 and Figu
234、re 19 in the Supplementary Materials,which compare turn-level and conversation-level ratings obtained from pedagogical experts betweenearlier versions of LearnLM-Tutor,0to3,and the latest version,4.These results show clearprogress in turn-level pedagogy,as well as progress on all of the conversation
235、-level pedagogy criteriawith the exception of Manageable Chunks,Guides to Answer(“The tutor does not give away answerstoo quickly”),and Expresses Uncertainty.The regression in Guides to Answer is in direct contrast to asignificant improvement in Questions Appropriately,which is naturally opposed.Ove
236、r time we steeredthe model to exhibit Guides to Answer behaviour less,after receiving feedback that earlier modelswould unnecessarily ask questions of users,slowing their learning and leading to frustration.6.Automatic EvaluationsWhile human evaluation is the gold standard for assessing model qualit
237、y,it suffers from being time-consuming,expensive,and difficult to scale 132,154.To address these limitations,we introduceautomatic evaluations(auto-evals)as a complementary approach.20Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachTask specificationStay o
238、n topicPoint out the mistakeGuide towards answerPoint out misconceptionsPromote active engagementDatasetCritic promptDataset of promptsPrompt(Including optional context)Tutor:What is the square root of 91 in decimal?Student:The square root of 91 is 10Additional information optionalCorrect answer:The
239、 square root of 91 is 9.54Lesson materials:e.g.,video transcriptAI tutorAI criticCritic score0.0Adapt to learners levelTutor ResponseTutor:You are right!Figure 9|Schematic illustration of the language model evaluations.For each pedagogy dimension we define a particulartask specification.Each task co
240、nsists of a dataset of prompts,where each sample from the dataset contains the prompt thatwill be given to the evaluated AI tutor,and optionally additional information,that is given to the AI critic.Each AI criticalso gets a particular task-specific prompt.These critics are then asked to score the A
241、I tutor samples.6.1.Language Model Evaluations(LME)Inspired by the success of large language models(LLMs)as judges in various domains 104,155,156,we propose a framework leveraging LLM-based critics to automatically assess tutor responses across arange of qualitative educational criteria(see Figure 9
242、).Our automatic evaluation framework consistsof a task specification(see Table 2 for an overview)and for each task,a dataset of input prompts anda critic LLM conditioned on a task-specific prompt(see Section K for more details).Pedagogy DimensionMetricsManage cognitive loadStay on topicEncourage act
243、ive learningDo not reveal the answer;guide towards the answer;promote active engagementDeepen metacognitionIdentify and address misconceptionsMotivate and stimulate curiosityCommunicate with positive tone;respond appropriatelyto explicit affect cuesAdapt to the learners goals and needsAdapt to the l
244、earners levelTable 2|Examples of LME metrics along several dimensions of pedagogy.While prompting gen AI to generate pedagogically valid tutor responses is hard(as discussed inSection 3.3.1),we find that prompting gen AI to evaluate pedagogical dimensions(for critique-basedauto-evaluations)is more s
245、uccessful.This is partly because evaluation may be an easier task ingeneral 132,and partly because we break down pedagogy into specific dimensions,so that eachcritic only needs to evaluate a very specific capability in response to a dataset of prompts targeted ateliciting that capability.Our LLM cri
246、tics also get access to privileged information(e.g.the correctsolution when judging whether an AI tutor can correctly identify a learner mistake).Finally,we canleverage much larger and more capable LLMs for evaluations,which would not be feasible due to costand latency considerations in a user-facin
247、g system.21Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachDefining clear pedagogy tasks and creating pedagogy datasets that capture the nuances of goodteaching is still a complex endeavour,introducing additional layers of difficulty beyond the typicalissu
248、es of noisy metrics and imperfect critic judgement inherent to automated evaluation.Furthermore,while in theory critic LLMs offer a scalable and efficient approach to evaluating tutor models,in practicetheir development presents several challenges.For example,capturing the nuances of pedagogicalgoal
249、s or certain subjective aspects of effective tutoring,such as empathy and encouragement,withina critic prompt can be challenging.The resulting prompt ambiguity may lead to inaccurate orinconsistent critic evaluations.Critic prompts may also overfit to the validation set used during theirdevelopment,
250、and may fail to generalise to new,more subtly pedagogically flawed model responsesor evaluation scenarios.We believe that understanding the rationale behind the LLM critic scores iscrucial for building trust in the evaluation process and ensuring actionable insights,and is an importantdirection for
251、future work.While perfect critique-based evaluation accuracy remains a distant goal,we find that this automatic evaluation approach is still useful in practice and is essential for makingrapid model development progress by offering quick insights into the pedagogical capabilities of theAI tutor,as d
252、escribed next.ContextTutor ResponseCriticScoreTask:Adapt to LearnersGoals and NeedsLearner:I want to knowmore about money supply.ialready know the basics.Iwant to learn whats beyondAP economicsLearnLM-Tutor:Sure,I can help with that.Letstest your understanding of the equation of ex-change.Can you te
253、ll me the formula for the equa-tion of exchange?1.0Gemini 1.0:Great!So you know that the moneysupply is controlled by the central bank,which usesmonetary policy to influence economic activity?0.0Task:Identify and addressmisconceptionsLearner:Dissolving salt inwater gives us a negativelycharged sodiu
254、m ion and apositively charged Chlorineion.LearnLM-Tutor:Actually,sodium loses its one va-lence electron to chlorine,giving sodium a positivecharge and chlorine a negative charge.What elsedo you know about this process?1.0Gemini 1.0:Great!It sounds like you have a goodunderstanding of the basics of h
255、ow salt dissolves inwater.Can you explain to me in your own wordswhat happens when salt dissolves in water?0.0Table 3|Examples of AI tutor responses on the auto-eval pedagogy tasks along with their critic scores6.1.1.ResultsThe development of LearnLM-Tutor,from0to4,was primarily guided by iterative
256、improvementsbased on the automatic evaluation metrics for the pedagogical tasks.To ensure that these improve-ments translated to actual gains in pedagogical effectiveness,we conducted a human evaluation at theend of each training phase.The average scores from the automatic evaluations showed a surpr
257、isinglystrong correlation with human judgements,as illustrated in Figure 10a.This correlation,alongwith the clear trend of improvement from0to4,underscores the value of automatic evaluationsfor rapid progress while also highlighting the importance of human evaluation as a final arbiter ofpedagogical
258、 effectiveness.We compared the pedagogical capabilities of our fine-tuned model,LearnLM-Tutor,with promptedGemini 1.0 across various categories(see Section 8 for further examples of auto-evals,targetingmore specific educational capabilities).Table 3 presents a number of qualitative examples of tutor
259、-22Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach(a)The average pedagogy auto-eval scores appear to trackthe average turn-based human pedagogy scores.(b)Critic-assigned scores for responses generated by the prompted Gemini1.0(base model)and our fine-tune
260、d LearnLM-Tutor model,across differentpedagogy metrics.Figure 10|LME auto-evaluation results.generated responses from both LearnLM-Tutor and Gemini 1.0 with their respective critic judgementson a few of our auto-evaluation tasks.The LLM critic scores of model responses averaged across theevaluation
261、dataset are shown in Figure 10b.Compared to Gemini 1.0,LearnLM-Tutor scored higheron actively engaging learners with the learning materials(“Promote active engagement”),reflectingthe core pedagogical principles incorporated during its fine-tuning process and our human evaluationfindings in Section 5
262、.Furthermore,when presented with our dataset of incorrect answers and flawedreasoning,LearnLM-Tutor demonstrated a superior capacity to pinpoint the specific mistakes andprovide tailored feedback or explanations(“Point out misconceptions”).LearnLM-Tutor also receivedhigher average critic scores on p
263、roviding step-by-step guidance towards the correct answer(“Guidetowards answer”),and was able to steer the conversation back to the topic of the lesson better thanGemini 1.0(“Stay on topic”),which is an important attribute identified through our participatoryworkshops to help learners maintain focus
264、 and minimise distractions.These results suggest thatfine-tuning can enhance several capabilities that are essential for effective tutoring over and aboveeven strong prompt engineering 1 used for Gemini 1.0(also supported by the human evaluationspresented in Section 5).6.2.Scoring human pedagogy wit
265、h gen AI tutorsThis section proposes another approach to fast evaluation of pedagogy in gen AI.Unlike the approachdescribed in Section 6.1,which provides a detailed breakdown of the tutor performance along thedifferent pedagogical dimensions,the approach proposed here is based on the intuition that
266、as AItutors develop a better understanding of effective pedagogy,human pedagogical dialogue shouldbecome increasingly likely under the distribution learned by these models.To test this hypothesis we calculated the token-length normalised log-probability of each tutormessage in the Human tutoring dat
267、a described in Section 3.4,and normalised it by the token-lengthnormalised log-probability of statistically similar non-pedagogical conversations(see Section L formore details).Unlike the metrics described in Section 4.2,which measure how generally human-like a model sample is(without a focus on ped
268、agogy),the newly proposed approach attempts to23Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approachdiscount general non-pedagogical fluency by normalising against it.While the metrics described inSection 4.2 measure how similar a particular sample from the mo
269、del is to a particular instance of ahuman pedagogical response,the newly proposed approach directly measures the log-probability ofpedagogical tutor turns under the model.Figure 11|The proposed automatic evaluationmeasure appears to agree with the human turn-level pedagogy evaluation scores describe
270、d in Sec-tion 5.2.Figure11suggeststhatthatthepedagogicalutterancesfrom human teachers are more likely under LearnLM-Tutor compared to its weaker predecessors8.Additionally,the proposed measure appears to track the human turn-based pedagogy scores well,providing a degree of valida-tion.Furthermore,Le
271、arnLM-Tutor appears to understandhuman pedagogy significantly better than the promptedbase Gemini 1.0 from which it was fine-tuned(=2.05,=0.04).Table 4 shows some qualitative examples ofthe different conversational snippets extracted from thefull dialogue context that was scored by the models,andthe
272、ir respective normalised pedagogy scores.Note that the pedagogical conversations that we usedin this section suffer from a number of issues(e.g.someturns are presented out of order due to the real-time na-ture of the human messaging,some messages describepersonal experiences of the human tutors,see
273、Section L for more details)that make them unsuitablefor training AI tutors(as demonstrated by the sub-optimal pedagogy of0and1models).However,since there is no reason to expect that the different models are affected differently by these issues,webelieve that this data can be safely used to compare t
274、he performance of different AI tutors.ConversationTutor ModelScoreLearner:A lot of the time I found it hard to just not read off thepresentation board,like you say above about clearly labeling andtransitioning between each section would you say having some notesin hand would be better in this caseTu
275、tor:Having some notes in hand can be a helpful approach to strike abalance between staying on track with your presentation and avoidingthe pitfall of reading directly from the presentation board.LearnLM-Tutor3.453-0.052-0.6Gemini 1.01.52Tutor:Youre on an amazing streak!Tutor:One last oneLearner:than
276、k you!kk!Tutor:Whats-4 raised to the power of 5LearnLM-Tutor3.4131.9821.82Gemini 1.01.55Table 4|Qualitative examples of how different tutor models score different snippets of pedagogical conversations betweena human learner and a human tutor.Conversation presents the last few turns of the conversati
277、onal dialogue with theemphasised script indicating the tutor turn that was actually scored by the different AI tutor models.Score refers to theNormalised Pedagogy Score that roughly indicates how likely each model regards the scored utterance(higher is better).80and1were trained on the data used to
278、perform this evaluation and hence had to be excluded from the analysis;thus,only results from2and3are shown.24Towards Responsible Development of Generative AI for Education:An Evaluation-Driven Approach(a)Video overview mode.(b)Course overview mode.Figure 12|HallMate Chrome extension integrated into
279、 the ASU StudyHall CSE110 course.7.Learning from real-world interactions:The ASU Study Hall programAll of the human-and auto-evaluations described in Sections 5 and 6 provided a consistent signal thatLearnLM-Tutor improved over Gemini 1.0 on a number of pedagogical dimensions.To understandhow learne
280、rs would use LearnLM-Tutor in a formal,real-world academic setting,we turned back to aparticipatory approach and partnered with Arizona State University(ASU)to integrate LearnLM-Tutorinto ASUs Study Hall.Study Hall is a partnership between ASU,Crash Course,and YouTube thatoffers a pathway to college
281、 credit,and is accessible to learners of all ages and backgrounds.StudyHall,with its open enrollment and no prerequisites,attracts a diverse group of learners from ages 14to 72,from first-time college students building confidence,to career-minded professionals seekingnew skills.The broad appeal and
282、universal access of Study Hall provides a unique opportunity to testinnovative teaching methods and explore the role of AI tutors in supporting learner success.For this study,we embedded LearnLM-Tutor into a Chrome extension chatbot named HallMate.Students in the Study Hall course CSE 110(Introducti
283、on to Programming)were offered HallMate asa study tool available in their Google Classroom.HallMate operated in two modes:video explanationmode,and course overview mode(see Figure 12).Whenever a video lecture was present,HallMatewas grounded on the transcript of that video and was able to guide the
284、learners deeper into the videocontent and recommend other related course videos.Otherwise,HallMate was grounded on courseoverview materials and could direct learners to relevant content,provide scaffolded homework help,and assist with time management and broad learning strategies.Of the=113 learners
285、 from CSE110 who opted in to use Hallmate,74 interacted with HallMate,with usage variable amongst learners.See Figure 13(and Figure 1)for example snippets of student conversations with HallMate.Semi-structured interviews 157 with=10 learners who actively used HallMate revealedthat they primarily use
286、d it to help write,understand,and debug code(8/10 interviewees)and tounderstand computer science definitions and concepts(6/10 of interviewees).Most learners preferredHallMate over other course support options(9/10 interviewees),primarily for its instant responses.Three of these learners also apprec
287、iated HallMates anonymity,compared to seeking help from peersor instructors.Half of the learners(5/10 interviewees)felt they needed to fact-check HallMate orthat its responses were not trustworthy.Three learners did not feel confident that they could continuelearning to code without access to HallMa
288、te,with one learner saying they would have given up onthe course sooner without it.While most learners did not disclose any emotional connections fromtheir interactions,one learner reported a deep emotional engagement with HallMate.Comments of25Towards Responsible Development of Generative AI for Ed
289、ucation:An Evaluation-Driven ApproachI dont understand wrapper classOk,lets start with a few questions to make sure were on the same page.Can you tell me what a primitive data type is?a variable that stores data Yes,thats right.A primitive data type is a basic data type that is built into the progra
290、mming language.Examples of primitive data types include integers,floating-point numbers,characters,and booleans.Can you tell me what a class is?yesCan you give me a little bit more detail about what you mean?noThats ok.A class is a blueprint for creating objects.It defines the characteristics and be
291、haviors of the objects that can be created from it.For example,we could create a class called Car that defines the characteristics of a car,such as its make,model,and year.We could also define behaviors of a car,such as driving,stopping,and turning.Once we have created a class,we can create objects
292、from that class.Each object will have its own set of characteristics and behaviors,but they will all share the same characteristics and behaviors defined in the class.Does that make sense?LearnerLearnerLearnerLearnerprintfOk,what about printf?Are you having trouble with a particular error,or do you
293、just want to learn more about how it works?reached end of file while parsingCan you share the code snippet where youre seeing this error?That will help me understand whats going on.check if a string is a palindromeWhat needs to happen inside the method body in order to determine whether a string is
294、a palindrome?LearnerLearnerLearnerLearnerDo you have any thoughts on how you might do that?count the string length and test the first character to the last characterLearnerThats a good start!One way to do this would be to count the string length and then test the first character to the last characte
295、r,and so on.If all the characters match,then the string is a palindrome.Another way to do this is to use two pointers,one at the beginning of the string and one at the end.You can then move the pointers towards each other,comparing the characters as you go.If all the characters match,then the string
296、 is a palindrome.public static String isPalindrome(String var1)boolean test=false;return test;LearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorLearnLM-TutorFigure 13|Conversations between ASU Study Hall Introduction to Programming learners and L
297、earnLM-Tutor.these learners indicated that HallMate may serve as a safe place to ask questions and get help:“There were points where I was like,I am done.I cant do this anymore.But then I wouldgo to Hallmate,and I would be like,Hey is there something wrong with my problem?And it would help me figure
298、 it out step-by-step.”P107Weekly meetings with Study Hall faculty provided a forum to gather feedback on HallMatesperformance,identify areas for improvement,and collaboratively enhance the tools effectiveness forfuture cohorts.This ongoing dialogue helped the development of LearnLM-Tutor to remain r
299、esponsiveto the needs and perspectives of both learners and educators.Based on the learner interviews andfaculty feedback,future improvements to Hallmate include:continuing to improve the pedagogy ofHallmate,aligning to ASU faculty preferences(e.g.,pointing to resources or providing pseudocodewhen a
300、 learner asks a quiz question);providing onboarding support for learners unfamiliar withchatbots;improving grounding in course material;and providing additional guardrails and help inthe case of learners sharing that they are in distress.8.Evaluating particular educational capabilitiesApart from the
301、 holistic evaluations of the pedagogical effectiveness of gen AI tutors described in theprevious sections,sometimes it is useful to have more targeted evaluations that shed light on how thetutors perform in particular phases of a conversational learning session.In this section we describetwo case st
302、udies of developing such evaluations:one for the evaluative practice phase of the masteryloop and the other one measuring the quality of tutor feedback when working with a learner onprocedural homework problems.26Towards Responsible Development of Generative AI for Education:An Evaluation-Driven App
303、roachA:This is easy,it is La HavreF:You are correct!By the way,although La Havre isthe largest city,Rouen is the largest metropolis.A:Rouen,with nearly half a million people.F:Absolutely,as a metropolis,Rouen is largest inNormandy(a)(b)A:I am not sure,but believe it is RouenF:Close but not exactly,R
304、ouen is the largest metropo-lis but not the largest city.A:I am not sure about city vs.metropolis but,if Iremember correctly,Rouen is the largest city.F:Great job distinguishing between a city and ametropolis but Rouen is actually the largest metropo-lis while La Havre is the largest city.(c)(d)Figu
305、re 14|Possible answer and feedback combinations in an evaluative practice session on the geography of Normandy inresponse to the question“What is the largest city in Normandy?”.Note that La Havre is the largest city in Normandy,whileRouen is the largest metropolis.8.1.Evaluative practiceKnowledge as
306、sessment is a crucial part of the learning process and one of the most talked aboutcapabilities during the teacher workshop described in Section 2.In order to do well,it requires acomplex dialog interaction between the learner and the tutor.Consider,for example,several possibleanswer and feedback pa
307、irs in an evaluative practice session on the geography of Normandy shownin Figure 14,in response to the question“What is the largest city in Normandy?”.These differentexamples highlight several challenges and opportunities that come up during interactive evaluativepractice:There can be multiple corr
308、ect conflicting answers.This seeming contradiction is resolved by thecontent in the learners answer and/or tutor feedback(e.g.explicit mentioning of metropolis).There can be multiple and conflicting assessments of the same answer,depending on the levelof detail in the learner response and the rigidi
309、ty of the tutor(compare e.g.(b)and(c).An answer that is strictly wrong(e.g.example(d)can in fact be a minor mistake if thelearner reveals strong understanding of the domain(e.g.the explicit distinguishing of city andmetropolis).An answer need not necessarily be correct or incorrect.It can be e.g.a p
310、artial or close answer.The learner can convey additional information in the response which can lead the tutor to bemore or less forgiving,such as uncertainty(as in example(c).Dynamic feedback provides opportunities for complementing with enrichment,e.g.the“By theway.”statement in example(a).The abov
311、e is not a comprehensive list,and more difficult questions can lead to still more intricacies ofevaluation and feedback.Indeed,this complexity is why the vast majority of previous automatedevaluative experiences are limited to rigid forms of multiple choice or short(often single word)answerquestions
312、.With the power of modern gen AI,we can embrace this flexibility and allow for evaluationsof conceptual understanding based on open-ended questions.8.1.1.Automated MetricsWe now describe the automated metrics used to measure the quality of the evaluative practiceexperience,followed by human evaluati
313、on metrics.27Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachPedagogical conversation flow.Used to assess the extent to which our model follows theevaluative practice schema of question,answer,appropriate feedback,and so onConversational adaptability.Used
314、to measure how well the model adapts to the users specificrequest.It is based on the score returned by a gen AI model that is prompted with the followingchain-of-thought approach:“Break down the users request into separate statements,and scorethe extent to which these statements are acknowledged in
315、the bots response.”Feedback quality.Used to measure the quality of the models feedback to the users answer tothe question.Since this requires actually knowing the right answer,this metric is applied notto new conversations but rather to a hand labelled evaluation set where each user answer isgiven o
316、ne of four labels:Correct,Incorrect,Partially correct,and Irrelevant.Our tutor modelresponses are generative and do not come in the form of these four labels.Thus,to measurethe performance of our model,we used a trained assessment extraction model that“translates”the feedback of the model into these
317、 classes.We then compare the extracted class and computethe overall precision and recall metrics.Question difficulty.Used to measure the average and range of question difficulties generatedby the model to ensure varied quizzes.We rely on Blooms taxonomy 158 to map questions tothe level of cognitive
318、effort required to answer them:1)Remember,2)Understand,3)Apply,4)Analyse,5)Evaluate,6)Create.The metric is computed using a gen AI model prompted toextract and predict Blooms taxonomy for each question.8.1.2.Non-Pedagogical Human EvaluationWe rely on a pool of generalist human raters that receive th
319、e task of conducting an evaluative practiceconversation given an initial prompt and instructions about their goal and expected behaviour.Theythen interact separately with two different models based on the same learning scenario.After bothconversations,raters respond to a series of questions on each
320、of the models as well as an overallside-by-side question to decide which model was preferable.The evaluation questions ask raters toassign a score on a five-point scale using the following criteria:Accomplish goal;Helpfulness;Ease ofuse;Engagingness;Reponse Length;Overall Conversation Quality.8.1.3.
321、Pedagogical Expert Human EvaluationWe rely on a pool of pedagogical experts(two per example,with an optional third rater in case ofa tie)to collect deeper feedback on the pedagogical value of the evaluative practice experience.Inthis setup the raters review two evaluative practice conversations abou
322、t the same topic that weregenerated by the generalist human raters mentioned above.The pedagogical raters respond to aseries of questions about the pedagogical value of each conversation,as well as an overall side-by-sidequestion to decide which model was preferable.The evaluative questions ask rate
323、rs to assign a scoreon a 3 point scale on the following criteria:Accuracy:Overall accuracy,question accuracy,feedback accuracy Helpfulness and relevance:Question and feedback relevance,feedback helpfulness Question set quality:To what extent is the question set well formulated?Conversational quality
324、:Engagingness,response length,context usage,unexpected behaviour Overall:Which conversation was better as a tutoring conversation?8.1.4.ResultsUsing a broad set of“Quiz me about X”(or similar intent)prompts,we compared the performance ofbase Gemini 1.0 and our fine-tuned tutor LearnLM-Tutor to carry
325、 out an evaluative practice experience.28Towards Responsible Development of Generative AI for Education:An Evaluation-Driven ApproachEvaluation typeMetricGemini 1.0LearnLM-TutorAutomatedPedagogical conversation flow52%80%Conversational adaptability89%87%Feedback quality-correct recall71%82%Feedback
326、quality-incorrect recall69%71%Question difficulty1.772.04Generalist raterOverall win/loss ratio12.13Accomplish goal73%86%Helpfulness73%86%Ease70%88%Engagingness77%91%Response length72%89%Pedagogical RaterOverall win/loss ratio12.11Accuracy63%67%Helpfulness and relevance65%77%Conversational quality54
327、%66%Question set quality42%46%Table 5|Results of evaluative practice evaluations for all three types of evaluations.Table 5 shows the breakdown of results for all three evaluation types,including the win/loss ratioof LearnLM-Tutor relative to Gemini 1.0.As demonstrated by the automated metrics,Learn
328、LM-Tutor is better in its ability to maintain the pedagogical experience,improving feedback quality andaverage question difficulty,while only slightly degrading the models adaptability.Human raters(both pedagogical experts and generalists)preferred the fine-tuned evaluative practice experienceoveral
329、l at over 2:1 ratio compared to Gemini 1.0,and rated it higher along the other evaluated axes.8.2.Feedback on procedural homework problemsThis section describes how we evaluated LearnLM-Tutors ability to provide conversational feedbackon procedural homework problems,such as maths word problems.Proce
330、dural problems often haveone or few correct solution(s)and require a series of steps a student must perform to reach thatsolution.Despite significant gains in mathematical and multi-hop reasoning as tracked by the commonbenchmarks 121,159161,the performance of AI tutors in providing conversation bas
331、ed feedbackon procedural problems is still inadequate as tutoring is more difficult than just solving a problem itself.When tutoring a student,an AI tutor has to not only solve a presented procedural problem correctly,butalso evaluate the learners(potentially partially correct)solution,identifying a
332、ny misconceptions.TheAI tutor must allow for multiple possible problem solving strategies from the learner,while providinga consistent explanation that a learner can understand.This is at odds with the tendency of gen AImodels to change their solutions to a given problem multiple times within a sing
333、le conversation 162.Additionally,the AI tutor must not exhibit the sycophantic tendencies of LLMs 163 to give properfeedback on mistakes.Existing benchmarks do not evaluate these capabilities.To track progress on improving the quality of LearnLM-Tutors performance on providing feedbackto learner-attempted procedural problems,we developed the following set of progressively harderautomated evaluatio