IEEE:2025具身智能(Embodied AI)綜述:從模擬器到研究任務的調查分析報告(英文版)(15頁).pdf

編號:617415 PDF  中文版  DOCX 15頁 2.25MB 下載積分:VIP專享
下載報告請您先登錄!

IEEE:2025具身智能(Embodied AI)綜述:從模擬器到研究任務的調查分析報告(英文版)(15頁).pdf

1、This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE1A Survey of Embodied AI:From Simulatorsto Research TasksJiafei Duan,Samson Yu,Hui Li Tan,Ho

2、ngyuan Zhu,and Cheston TanAbstractThere has been an emerging paradigm shift from theera of“internet AI”to“embodied AI,”where AI algorithms andagents no longer learn from datasets of images,videos or textcurated primarily from the internet.Instead,they learn throughinteractionswiththeirenvironmentsfr

3、omanegocentricperceptionsimilartohumans.Consequently,therehasbeensubstantialgrowthin the demand for embodied AI simulators to support variousembodied AI research tasks.This growing interest in embodied AIis beneficial to the greater pursuit of Artificial General Intelligence(AGI),but there has not b

4、een a contemporary and comprehensivesurvey of this field.This paper aims to provide an encyclopedicsurvey for the field of embodied AI,from its simulators to itsresearch.By evaluating nine current embodied AI simulators withour proposed seven features,this paper aims to understand thesimulators in t

5、heir provision for use in embodied AI research andtheirlimitations.Lastly,thispapersurveysthethreemainresearchtasks in embodied AI visual exploration,visual navigation andembodied question answering(QA),covering the state-of-the-artapproaches,evaluation metrics and datasets.Finally,with the newinsig

6、htsrevealedthroughsurveyingthefield,thepaperwillprovidesuggestionsforsimulator-for-taskselectionsandrecommendationsfor the future directions of the field.Index TermsEmbodied AI,computer vision,3D simulators.I.INTRODUCTIONRECENT advances in deep learning,reinforcement learn-ing,computer graphics and

7、robotics have garnered grow-ing interest in developing general-purpose AI systems.As aresult,there has been a shift from“internet AI”that focuseson learning from datasets of images,videos and text curatedfrom the internet,towards“embodied AI”which enables artifi-cial agents to learn through interact

8、ions with their surroundingenvironments.Embodied AI is the belief that true intelligencecan emerge from the interactions of an agent with its envi-ronment 1.But for now,embodied AI is about incorporatingManuscriptreceivedMay9,2021;revisedSeptember30,2021andNovember8,2021;accepted December 4,2021.Thi

9、s work was supported in part by theAgency for Science,Technology and Research(A*STAR),Singapore under itsAME Programmatic Funding Scheme under Award No.#A18A2b0046 and inpart by the National Research Foundation,Singapore under its NRFISF JointCall under Award No.NRF2015-NRF-ISF001-2541.(Correspondin

10、g author:Jiafei Duan.)Jiafei Duan is with the Nanyang Technological University of Singapore,School of Electrical and Electronics Engineering,Singapore 639798(e-mail:duanjiafeihotmail.sg).Samson Yu is with the Singapore University of Technology and Design,Singapore 487372(e-mail:samsonyumymail.sutd.e

11、du.sg).Hui Li Tan,Hongyuan Zhu,and Cheston Tan are with the Insti-tute for Infocomm Research,A*STAR,Singapore 138632(e-mail:hltani2r.a-star.edu.sg;zhuhi2r.a-star.edu.sg;cheston-tani2r.a-star.edu.sg).Digital Object Identifier 10.1109/TETCI.2022.3141105traditional intelligence concepts from vision,lan

12、guage,and rea-soning into an artificial embodiment to help solve AI problemsin a virtual environment.The growing interest in embodied AI has led to significantprogress in embodied AI simulators that aim to faithfully repli-cate the physical world.These simulated worlds serve as virtualtestbeds to tr

13、ain and test embodied AI frameworks before de-ployingthemintotherealworld.TheseembodiedAIsimulatorsalso facilitate the collection of task-based dataset 2,3 whichare tedious to collect in real-world as it requires an extensiveamount of manual labor to replicate the same setting as in thevirtual world

14、.While there have been several survey papers inthe field of embodied AI 46,they are mostly outdated astheywerepublishedbeforethemoderndeeplearningera,whichstarted around 2009 710.To the best of our knowledge,there is only one survey paper on the evaluating embodiednavigation 11.To address the scarci

15、ty of contemporary comprehensive sur-vey papers on this emerging field of embodied AI,we pro-pose this survey paper on the field of embodied AI,from itssimulators to research tasks.This paper covers the follow-ing nine embodied AI simulators that were developed overthe past four years:DeepMind Lab 1

16、2,AI2-THOR 13,CHALET 14,VirtualHome 15,VRKitchen 16,Habitat-Sim 17,iGibson 18,SAPIEN 19,and ThreeDWorld 20.The chosen simulators are designed for general-purpose in-telligence tasks,unlike game simulators 21 which are onlyused for training reinforcement learning agents.These em-bodied AI simulators

17、provide realistic representations of thereal world in computer simulations,mainly taking the config-urations of rooms or apartments that provide some forms ofconstraint to the environment.The majority of these simulatorsminimally comprise a physics engine,Python API,and arti-ficial agent that can be

18、 controlled or manipulated within theenvironment.EmbodiedAIsimulatorshavegivenrisetoaseriesofpotentialembodied AI research tasks,such as visual exploration,visualnavigationandembodiedQA.Wewillfocusonthesethreetaskssincemostexistingpapers11,22,23inembodiedAIeitherfocusonthesetasksormakeuseofmodulesin

19、troducedforthesetasks to build models for more complex tasks like audio-visualnavigation.These three tasks are also connected in increasingcomplexity.Visual exploration is a very useful component invisualnavigation22,24andusedforrealisticsituations25,26,whileembodiedQAfurtherinvolvescomplexQAcapabil

20、-ities that builds on top of vision-and-language navigation.Since2471-285X 2022 IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See https:/www.ieee.org/publications/rights/index.html for more information.Authorized licensed use limited to:ASTAR.Downloaded on

21、February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.2IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCETABLE ISUMMARY OFEMBOD

22、IEDAI SIMULATORS.ENVIRONMENT:GAME-BASEDSCENECONSTRUCTION(G)ANDWORLD-BASEDSCENECONSTRUCTION(W).PHYSICS:BASICPHYSICSFEATURES(B)ANDADVANCEDPHYSICSFEATURES(A).OBJECTTYPE:DATASETDRIVENENVIRONMENTS(D)ANDOBJECTASSETSDRIVENENVIRONMENTS(O).OBJECTPROPERTY:INTERACT-ABLEOBJECTS(I)ANDMULTI-STATEOBJECTS(M).CONTRO

23、LLER:DIRECTPYTHONAPI CONTROLLER(P),VIRTUALROBOT CONTROLLER(R)ANDVIRTUALREALITYCONTROLLER(V).ACTION:NAVIGATION(N),ATOMICACTION(A)ANDHUMAN-COMPUTERINTERACTION(H).MULTI-AGENT:AVATAR-BASED(AT)ANDUSER-BASED(U).THESEVENFEATURES CAN BEFURTHERGROUPEDUNDERTHREESECONDARYEVALUATIONFEATURES;REALISM,SCALABILITY

24、ANDINTERACTIVITYlanguage is a common modality and visual QA is a populartask in AI,embodied QA is a natural direction for embodied AI.Thesethreetasksdiscussedinthispaperhavebeenimplementedin at least one of the nine proposed embodied AI simulators.However,Sim2Real2729androboticsinthephysicalworldwil

25、l not be covered in this paper.ThesesimulatorsareselectedbasedontheembodiedAIsimu-lators from the Embodied AI Challenge in the annual EmbodiedAIworkshop30atConferenceonComputerVisionandPatternRecognition(CVPR).The research tasks are then sourced fromdirect citations of these simulators.To this end,w

26、e will provide a contemporary and compre-hensive survey of embodied AI simulators and research throughreviewing the development of the field from its simulator toresearch.In Section I,this paper outlines the overview struc-ture of this survey paper.In Section II,this paper benchmarksnine embodied AI

27、 simulators to understand their provision forrealism,scalability,interactivity and hence use in embodiedAI research.Finally,based upon the simulators,in Section III,this paper surveys three main research tasks in embodied AI-visual exploration,visual navigation and embodied questionanswering(QA),cov

28、ering the state-of-the-art approaches,eval-uation,and datasets.Lastly,this paper will establish intercon-nectionsbetweenthesimulators,datasetsandresearchtasksandexisting challenges in embodied AI simulators and researchin Section IV.This survey paper provides a comprehensivelook into the emerging fi

29、eld of embodied AI and further un-veils new insights and challenges of the field.Furthermore,through this paper,we seek to avail AI researchers in select-ing the ideal embodied AI simulators for their research tasksof interest.II.SIMULATORS FOREMBODIEDAIInthissection,thebackgroundsoftheembodiedAIsim

30、ulatorswillbepresentedinthesupplementarymaterial,andthefeaturesof the embodied AI simulators will be compared and discussedin Section II-A.Fig.1.Comparison between game-based scene(G)and world-based scene(W).The game-based scene(G)focuses on environment that are constructedfrom 3D object assets,whil

31、e the world-based scene(W)are constructed basedoff real-world scans of the environment.A.Embodied AI SimulatorsThis section presents the backgrounds of the nine embodiedAI simulators:DeepMind Lab,AI2-THOR,SAPIEN,Virtu-alHome,VRKitchen,ThreeDWorld,CHALET,iGibson,andHabitat-Sim.Readers can refer to th

32、e supplementary materialfor more details on the respective simulators.In this section,the paper will comprehensively compare the nine embodied AIsimulators based on seven technical features.Referencing 13,20,31,these seven technical features are selected as theprimary features to evaluate the embodi

33、ed AI simulator as theycover the essential aspects required to replicate the environmentaccurately,interactions and state of the physical world,henceprovidingsuitabletestbedsfortestingintelligencewithembodi-ment.Referring to Table I,the seven features are:Environment,Physics,Object Type,Object Prope

34、rty,Controller,Action,andMulti-Agent.1)Environment:There are two main methods of constructingtheembodiedAIsimulatorenvironment:game-basedscenecon-struction(G)andworld-basedsceneconstruction(W).Referringto Fig.1,the game-based scenes are constructed from 3D as-sets,while world-based scenes are constr

35、ucted from real-worldscans of the objects and the environment.A 3D environmentconstructed entirely out of 3D assets often has built-in physicsfeatures and object classes that are well-segmented when com-pared to a 3D mesh of an environment made from real-worldAuthorized licensed use limited to:ASTAR

36、.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS3Fig.2.

37、Comparison between basics physics features such as rigid-body andcollision(B)andadvancedphysicsfeatures(A)whichincludescloth,soft-body,and fluid physics.scanning.Theclearobjectsegmentationforthe3Dassetsmakesit easy to model them as articulated objects with movable joints,such as the 3D models provid

38、ed in PartNet 32.In contrast,thereal-world scans of environments and objects provide higherfidelity and more accurate representation of the real-world,facilitatingbettertransferofagentperformancefromsimulationto the real world.As observed in Table I,most simulators otherthan Habitat-Sim and iGibson

39、have game-based scenes,sincesignificantly more resources are required for world-based sceneconstruction.2)Physics:A simulator has to construct not only realisticenvironments but also realistic interactions between agents andobjects or objects and objects that model real-world physicsproperties.We st

40、udy the simulators physics features,whichwe broadly classify into basic physics features(B)and ad-vanced physics features(A).Referring to Fig.2,basic physicsfeatures include collision,rigid-body dynamics,and gravitymodelling while advanced physics features include cloth,fluid,and soft-body physics.A

41、s most embodied AI simulators con-struct game-based scenes with in-built physics engines,they areequipped with the basic physics features.On the other hand,forsimulators like ThreeDWorld,where the goal is to understandhow the complex physics environment can shape the decisionsof the artificial agent

42、 in the environment,they are equipped withmore advanced physics capabilities.For simulators that focuson interactive navigation-based tasks,basic physics features aregenerally sufficient.3)ObjectType:AsshowninFig.3,therearetwomainsourcesfor objects that are used to create the simulators.The first ty

43、peis the dataset driven environment,where the objects are mainlyfrom existing object datasets such as the SUNCG 33 dataset,the Matterport3D dataset 34 and the Gibson dataset 35.Thesecond type is the asset driven environment,where the objectsare from the net such as the Unity 3D game asset store.Adif

44、ference between the two sources is the sustainability of theobject dataset.The dataset driven objects are more costly tocollect than the asset driven objects,as anyone can contributeto the 3D object models online.However,it is harder to ensureFig.3.Comparison between dataset driven environment(D)whi

45、ch are con-structed from 3D objects datasets and object assets driven environment(O)areconstructed based 3D objects obtain from the assets market.Fig.4.Comparison between direct Python API controller(P),robotics em-bodiment(R)which refers to real-world robots with a virtual replica and lastlythe vir

46、tual reality controller(V).the quality of the 3D object models in the asset driven objectsthan in the dataset driven objects.Based on our review,thegame-based embodied AI simulators are more likely to obtaintheir object datasets from asset stores,whereas the world-basedsimulators tend to import thei

47、r object datasets from existing 3Dobject datasets.4)ObjectProperty:Somesimulatorsonlyenableobjectswithbasicinteractivitysuchascollision.Advancedsimulatorsenableobjects with more fine-grained interactivity such as multiple-state changes.For instance,when an apple is sliced,it willundergo a state chan

48、ge into apple slices.Hence,we catego-rize these different levels of object interaction into simulatorswith interact-able objects(I)and multiple-state objects(M).Referring to Table I,a few simulators,such as AI2-THOR andVRKitchen,enablemultiplestatechanges,providingaplatformfor understanding how obje

49、cts will react and change their stateswhen acted upon in the real world.5)Controller:Referring to Fig.4,there are different typesof controller interface between the user and simulator,fromdirect Python API controller(P)and virtual robot controller(R)tovirtualrealitycontroller(V).Roboticsembodimental

50、lowsforvirtualinteractionofexistingreal-worldrobotssuchasUniversalRobot 5(UR5)and TurtleBot V2,and can be controlled directlyusing a ROS interface.The virtual reality controller interfacesprovide more immersive human-computer interaction and fa-cilitate deployment using their real-world counterparts

51、.For in-stance,simulators such as iGibson and AI2-THOR,which areprimarily designed for visual navigation,are also equipped withvirtualrobotcontrollerforeaseofdeploymentintheirreal-worldcounterpartssuchasiGibsonsCastro36andRoboTHOR37respectively.Authorized licensed use limited to:ASTAR.Downloaded on

52、February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.4IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE6)Action:There are dif

53、ferences in the complexity of an arti-ficial agents action capabilities in the embodied AI simulator,ranging from being only able to perform primary navigationmanoeuvers to higher-level human-computer actions via virtualreality interfaces.This paper classifies them into three tiers ofrobotics manipu

54、lation:navigation(N),atomic action(A)andhuman-computer interaction(H).Navigation is the lowest tierand is a common feature in all embodied AI simulators 38.It is defined by the agents capability of navigating around itsvirtual environment.Atomic action provides the artificial agentwith a means of pe

55、rforming basic discrete manipulation to anobject of interest and is found in most embodied AI simulators.Human-computer interaction is the result of the virtual realitycontroller as it enables humans to control virtual agents tolearn and interact with the simulated world in real time 16.Most of the

56、larger-scale navigation-based simulators,such asAI2-THOR,iGibson and Habitat-Sim,tend to have navigation,atomic action and ROS 13,17,35 which enable themto provide better control and manipulation of objects in theenvironment while performing tasks such as Point Navigationor Object Navigation.On the

57、other hand,simulators such asThreeDWorld and VRKitchen 16,20 fall under the human-computerinteractioncategoryastheyareconstructedtoprovidea highly realistic physics-based simulation and multiple statechanges.Thisisonlypossiblewithhuman-computerinteractionas human-level dexterity is needed when inter

58、acting with thesevirtual objects.7)Multi-Agent:Referring to Table I,only a few simulators,such as AI2-THOR,iGibson and ThreeDWorld,are equippedwithmulti-agentsetup,ascurrentresearchinvolvingmulti-agentreinforcement learning isscarce.Ingeneral,the simulatorsneedto be rich in object content before the

59、re is any practical value ofconstructing such multi-agent features used for both adversarialand collaborative training 39,40 of artificial agents.As aresultofthislackofmulti-agentsupportedsimulators,therehavebeen fewer research tasks that utilize the multi-agent feature inthese embodied AI simulator

60、s.For multi-agent reinforcement learning based training,theyarestillcurrentlybeingdoneinOpenAIGymenvironments41.Therearetwodistinctmulti-agentsettings.Thefirstistheavatar-based(AT)multi-agents in ThreeDWorld 20 that allows forinteraction between artificial agents and simulation avatars.Thesecond is

61、the user-based(U)multi-agents in AI2-THOR 13which can take on the role of a dual learning network and learnfrom interacting with other artificial agents in the simulation toachieve a common task 42.B.Comparison of Embodied AI SimulatorsConstructed on the seven features and a study from the AllenInst

62、itute of Artificial Intelligence 31 on embodied AI,wepropose a secondary set of evaluation features for the simu-lators.It comprises of three key features:realism,scalabilityand interactivity as shown in Table I.The realism of the 3Denvironments can be attributed to the environment and physicsof the

63、 simulators.The environment models the real worldsphysical appearance while the physics models the complexFig.5.A pyramid hierarchical structure of the various embodied AI researchtasks with increasing complexity of tasks.physical properties within the real world.Scalability of the 3Denvironmentscan

64、beattributedtotheobjecttype.Theexpansioncan be done via collecting more 3D scans of the real world forthe dataset driven objects or purchasing more 3D assets for theassetdrivenobjects.Interactivityisattributedtoobjectproperty,controller,action and multi-agent.Based on the secondary evaluation featur

65、es of embodied AIsimulators,the seven primary features from the Table I and theFig.6,simulatorswhichpossessalloftheabovethreesecondaryfeatures(e.g.AI2-THOR,iGibson and Habitat-Sim)are morewell-received and widely used for a diverse range of embodiedAI research tasks.Furthermore,a comprehensive quant

66、itativecomparison is made for all the embodied AI simulators tocompare the environment configuration and the technical per-formance of each simulator.The environment configurationfeatureisverymuchdependentontheapplicationssuggestedbythecreatorsofthesimulators,whileotherfeaturesliketechnicalspecifica

67、tion and rendering performance are largely due tothe simulation engine used for its creation.AI2-THOR hasthe largest environment configurations compared to the othersimulators,while Habitat-Sim and iGibson are the top two per-formers in graphic rendering performance.This benchmark ofquantitativeperf

68、ormanceshowninTableIIfurtherdemonstratesthe superiority and complexity of these three embodied AIsimulators.These comparisons of the embodied AI simulatorsfurther have reinforced the importance of the seven primaryevaluation metrics and the three secondary evaluations that thepaper has established t

69、o help select the ideal simulator for theresearch task.III.RESEARCH INEMBODIEDAIIn this section,we discuss the various embodied AI researchtasks that depend on the nine embodied AI simulators surveyedin the previous section.There are multiple motivations for therecent increase in embodied AI researc

70、h.From a cognitivescience and psychology perspective,the embodiment hypoth-esis 1 suggests that intelligence arises from interactions withan environment and as a result of sensorimotor activity 66.Intuitively,humans do not learn solely through the“internetAI”paradigm where most experiences are rando

71、mized andAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMB

72、ODIED AI:FROM SIMULATORS TO RESEARCH TASKS5Fig.6.Connections between Embodied AI simulators to research.(Top)Nine up-to-date embodied AI simulators.(Middle)The various embodied AI researchtasks as a result of the nine embodied AI simulators.The red colored research tasks are grouped under the visual

73、 navigation category while the rest of the yellowcolored tasks are the other research categories.(Bottom)The evaluation dataset used in the evaluation of the research tasks in one of the nine embodied AIsimulators.TABLE IICOMPARISON OFEMBODIEDAI SIMULATORS INTERMS OFENVIRONMENTCONFIGURATION,SIMULATI

74、ONENGINE,TECHNICALSPECIFICATION,ANDRENDERINGPERFORMANCEpassive(i.e.externally curated).Humans also learn throughactive perception,movement,interaction and communication.From an AI perspective,current research tasks in embodied AIallows for greater generalization to unseen environments 44for robotic

75、functions like mapping and navigation and greaterrobustnesstosensornoiseascomparedtoclassicalmethodsdueto the learning involved.Embodied AI also enables flexibilityand possibly greater performance since various modalities likedepth,language 59 and audio 67 can be easily integratedthrough learning-ba

76、sed approaches.ThethreemaintypesofembodiedAIresearchtasksarevisualexploration,visual navigation and embodied QA.We will focuson these three tasks since most existing papers in embodied AIeither focus on these tasks or make use of modules introducedfor these tasks to build models for more complex tas

77、ks likeaudio-visual navigation.The tasks increase in complexity asit advances from exploration to QA.We will start with thevisualexplorationbeforemovingtovisualnavigationandfinallyembodied QA.Each of these tasks makes up the foundation forthe next task(s),forming a pyramid structure of embodied AIre

78、search tasks as shown in Fig.5,further suggesting a naturaldirection for embodied AI.We will highlight important aspectsfor each task,starting with the summary,the methodologies,evaluation metrics,to the datasets.These task details are foundin Table III.A.Visual ExplorationIn visual exploration 24,6

79、8,an agent gathers informationabout a 3D environment,typically through motion and per-ception,to update its internal model of the environment 11,22,which might be useful for downstream tasks like visualnavigation 24,25,69.The aim is to do this as efficientlyas possible(e.g.with as few steps as possi

80、ble).The internalmodel can be in forms like a topological graph map 26,semanticmap46,occupancymap45orspatialmemory70,71.These map-based architectures can capture geometry andsemantics,allowing for more efficient policy learning and plan-ning 45 as compared to reactive and recurrent neural networkpol

81、icies 72.Visual exploration is usually either done beforeor concurrently with navigation tasks.In the first case,visualexploration builds the internal memory as priors that are usefulfor path-planning in downstream navigation tasks.The agent isfree to explore the environment within a certain budget(

82、e.g.limited number of steps)before the start of navigation 11.In the latter case,the agent builds the map as it navigates anunseen test environment 48,73,74,which makes it moretightly integrated with the downstream task.In this section,webuild upon existing visual exploration survey papers 22,24to i

83、nclude more recent works and directions.In classical robotics,exploration is done through passive oractivesimultaneouslocalisationandmapping(SLAM)24,45to build a map of the environment.This map is then used withlocalization and path-planning for navigation tasks.SLAM isvery well-studied 75,but the p

84、urely geometric approach hasroom for improvements.Since they rely on sensors,they aresusceptibletomeasurementnoise24andwouldneedextensivefine-tuning.On the other hand,learning-based approaches thattypically use RGB and/or depth sensors are more robust tonoise 24,44.Furthermore,learning-based approac

85、hes in vi-sualexplorationallowanartificialagenttoincorporatesemanticAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented

86、,with the exception of pagination.6IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCETABLE IIISUMMARY OFEMBODIEDAI RESEARCHTASKS.EVALUATIONMETRIC:AMOUNT OFTARGETSVISITED(ATV),DOWNSTREAMTASKS(D),SUCCESSWEIGHTED BYPATHLENGTH(SPL),SUCCESSRATE(SR),PATHLENGTHRATIO(PLR),ORACLESUCCESSRATE(O

87、SR),TRAJECTORY/EPISODELENGTH(TL/EL),DISTANCE TOSUCCESS/NAVIGATIONERROR(DTS/NE/dT),GOALPROGRESS(GP/d),ORACLEPATHSUCCESSRATE(OPSR),SMALLESTDISTANCE TOTARGET AT ANYPOINT IN ANEPISODE(dmin),PERCENTAGE OFEPISODESAGENTENDSNAVIGATION FORANSWERINGBEFOREMAXEPISODELENGTH(%Stop),PERCENTAGE OFQUESTIONSAGENTTERM

88、INATES IN THEROOMCONTAINING THETARGETOBJECT(%rT),PERCENTAGE OFQUESTIONSWHERE THEAGENTENTERS THEROOMCONTAINING THETARGETOJECT ATLEASTONCE(%re),INTERSECTIONOVERUNION FORTARGETOBJECT(IOU),HITACCURACYBASED ONIOU(hT),MEANRANK OF THEGROUND-TRUTHANSWER INQA PREDICTIONS(MR)ANDQA ACCURACY(ACC)understanding(e

89、.g.object types in the environment)45 andgeneralize its knowledge of previously seen environments tohelp with understanding novel environments in an unsupervisedmanner.This reduces reliance on humans and thus improvesefficiency.Learning to create useful internal models of the environmentin the form

90、of maps can improve the agents performance 45,whether it is done before(i.e.unspecified downstream tasks)or concurrently with downstream tasks.Intelligent explorationwould also be especially useful in cases where the agent hasto explore novel environments that dynamically unfold overtime76,suchasres

91、cuerobotsanddeep-seaexplorationrobots.1)Approaches:In this section,the non-baseline approachesin visual exploration are typically formalized as partially ob-served Markov decision processes(POMDPs)77.A POMDPcan be represented by a 7-tuple(S,A,T,R,O,)with statespace S,action space A,transition distri

92、bution T,reward func-tion R,observation space,observation distribution O and dis-count factor 0,1.In general,these approaches are viewedas a particular reward function in the POMDP 22.a)Baselines:Visual exploration has a few common base-lines 22.For random-actions 17,the agent samples froma uniform

93、distribution over all actions.For forward-action,italways chooses the forward action.For forward-action+,theagent chooses the forward action,but turns left if it collides.For frontier-exploration,it visits the edges between free andunexplored spaces iteratively using a map 24,78.b)Curiosity:In the c

94、uriosity approach,the agent seeks statesthat are difficult to predict.The prediction error is used as therewardsignalforreinforcementlearning79,80.Thisfocuseson intrinsic rewards and motivation rather than external rewardsfrom the environment,which is beneficial in cases where exter-nalrewardsarespa

95、rse81.Thereisusuallyaforward-dynamicsmodel that minimises the loss:L(st+1,st+1).In this case,st+1is the predicted next state if the agent takes action atwhenit is in state st,while st+1is the actual next state that theagent will end up in.Practical considerations for curiosity havebeen listed in rec

96、ent work 79,such as using Proximal PolicyOptimization(PPO)for policy optimisation.Curiosity has beenused to generate more advanced maps like semantic maps inrecent work 43.Stochasticity poses a serious challenge inthe curiosity approach,since the forward-dynamics model canexploit stochasticity 79 fo

97、r high prediction errors(i.e.highrewards).This can arise due to factors like the“noisy-TV”problemornoiseintheexecutionoftheagentsactions81.Oneproposed solution is the use of an inverse-dynamics model 68that estimates the action at1taken by the agent to move fromits previous state st1to its current s

98、tate st,which helps theagentunderstandwhatitsactionscancontrolintheenvironment.While this method attempts to address stochasticity due to theenvironment,it may be insufficient in addressing stochasticitythat results from the agents actions.One example is the agentsuse of a remote controller to rando

99、mly change TV channels,allowing it to accumulate rewards without progress.To addressthis more challenging issue specifically,there have been a fewmethods proposed recently.Random Distillation Network 82is one method that predicts the output of a randomly initializedneural network,as the answer is a

100、deterministic function of itsinputs.Another method is Exploration by Disagreement 81,where the agent is incentivised to explore the action spacewhich has the maximum disagreement or variance between thepredictions of an ensemble of forward-dynamics models.Themodels converges to mean,which reduces th

101、e variance of theensembleandpreventsitfromgettingstuckinstochasticitytraps.c)Coverage:In the coverage approach,the agent tries tomaximise the amount of targets it directly observes.Typically,Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Rest

102、rictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS7this would be the area seen in an environment 22,24,44.Since the agent uses

103、 egocentric observations,it has to navigatebased on possibly obstructive 3D structures.One recent methodcombines classic and learning-based methods 44.It uses an-alytical path planners with a learned SLAM module that main-tains a spatial map,to avoid the high sample complexities in-volvedintraininge

104、nd-to-endpolicies.Thismethodalsoincludesnoise models to improve physical realism for generalisability toreal-world robotics.Another recent work is a scene memorytransformer which uses the self-attention mechanism adaptedfrom the Transformer model 83 over the scene memory in itspolicy network 72.The

105、scene memory embeds and stores allencountered observations,allowing for greater flexibility andscalability as compared to a map-like memory that requiresinductive biases.d)Reconstruction:In the reconstruction approach,the agenttries to recreate other views from an observed view.Past workfocuses on p

106、ixel-wise reconstructions of 360panoramas andCAD models 84,85,which are usually curated datasetsof human-taken photos 45.Recent work has adapted thisapproach for embodied AI,which is more complex becausethe model has to perform scene reconstruction from the agentsegocentric observations and the cont

107、rol of its own sensors(i.e.activeperception).Inarecentwork,theagentusesitsegocentricRGB-D observations to reconstruct the occupancy state beyondvisible regions and aggregate its predictions over time to forman accurate occupancy map 45.The occupancy anticipation isa pixel-wise classification task wh

108、ere each cell in a local areaof V x V cells in front of the camera is assigned probabilitiesof it being explored and occupied.As compared to the coverageapproach,anticipating the occupancy state allows the agent todealwithregionsthatarenotdirectlyobservable.Anotherrecentwork focuses on semantic reco

109、nstruction rather than pixel-wisereconstruction 22.The agent is designed to predict whethersemantic concepts like“door”are present at sampled querylocations.Using a K-means approach,the true reconstructionconcepts for a query location are the J nearest cluster centroidsto its feature representation.

110、The agent is rewarded if it obtainsviews that help it predict the true reconstruction concepts forsampled query views.2)Evaluation Metrics:Amount of targets visited.Differenttypes of targets are considered,such as area 44,86 andinteresting objects 72,87.The area visited metric has afew variants,such

111、 as the absolute coverage area in m2and thepercentage of the area explored in the scene.a)Impact on downstream tasks:Visual exploration perfor-mance can also be measured by its impact on downstream taskslike visual navigation.This evaluation metric category is morecommonly seen in recent works.Examp

112、les ofdownstream tasksthat make use of visual exploration outputs(i.e.maps)includeImage Navigation 26,73,Point Navigation 11,44 andObject Navigation 53,54,56.More details about thesenavigation tasks can be found in Section III-B.3)Datasets:For visual exploration,some popular datasetsincludeMatterpor

113、t3DandGibsonV1.Matterport3DandGibsonV1arebothphotorealisticRGBdatasetswithusefulinformationfor embodied AI like depth and semantic segmentations.TheHabitat-Sim simulator allows for the usage of these datasetswith extra functionalities like configurable agents and multiplesensors.GibsonV1hasalsobeene

114、nhancedwithfeatureslikein-teractions and realistic robot control to form iGibson.However,more recent 3D simulators like those mentioned in Section IIcan all be used for visual exploration,since they all offer RGBobservations at the very least.B.Visual NavigationIn visual navigation,an agent navigate

115、s a 3D environmentto a goal with or without external priors or natural languageinstruction.Many types of goals have been used for this task,such as points,objects,images 88,89 and areas 11.Wewill focus on points and objects as goals for visual navigationin this paper,as they are the most common and

116、fundamentalgoals.They can be further combined with specifications likeperceptual inputs and language to build towards more complexvisual navigation tasks,such as Navigation with Priors,Vision-and-Language Navigation and even Embodied QA.Under pointnavigation 49,the agent is tasked to navigate to a s

117、pecificpoint while in object navigation 38,52,the agent is taskedto navigate to an object of a specific class.While classic navigation approaches 90 are usually com-posed of hand-engineered sub-components like localization,mapping 91,path planning 92,93 and locomotion.Visualnavigation in embodied AI

118、 aims to learn these navigation sys-tems from data,so as to reduce case-specific hand-engineering,henceeasingintegrationwithdownstreamtaskshavingsuperiorperformance with the data-driven learning methods,such asquestion answering 23.There are also hybrid approaches 44that aim to combine the best of b

119、oth worlds.As previously men-tioned in Section II,learning-based approaches are more robustto sensor measurement noise as they use RGB and/or depthsensorsandareabletoincorporatesemanticunderstandingofanenvironment.Furthermore,theyenableanagenttogeneralizeitsknowledge of previously seen environments

120、to help understandnovelenvironmentsinanunsupervisedmanner,reducinghumaneffort.Along with the increase in research in recent years,chal-lenges have also been organised for visual navigation in thefundamental point navigation and object navigation tasks tobenchmark and accelerate progress in embodied

121、AI 38.Themost notable challenges are the iGibson Sim2Real Challenge,Habitat Challenge 36 and RoboTHOR Challenge.For eachchallenge,we will describe the 2020 version of the challenges,which is the latest as of this paper.In all three challenges,theagent is limited to egocentric RGB-D observations.For

122、theiGibson Sim2Real Challenge 2020,the specific task is pointnavigation.73 high-quality Gibson 3D scenes are used fortraining,while the Castro scene,the reconstruction of a realworld apartment,will be used for training,development andtesting.There are three scenarios:when the environment is freeof o

123、bstacles,contains obstacles that the agent can interact with,and/or is populated with other moving agents.For the HabitatChallenge 2020,there are both point navigation and objectnavigation tasks.Gibson 3D scenes with Gibson dataset splitsAuthorized licensed use limited to:ASTAR.Downloaded on Februar

124、y 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.8IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEare used for the point naviga

125、tion task,while 90 Matterport3Dscenes with the 61/11/18 training/validation/test house splitsspecifiedbytheoriginaldataset11,34areusedfortheobjectnavigation task.For the RoboTHOR Challenge 2020,there isonly the object navigation task.The training and evaluation aresplit into three phases.In the firs

126、t phase,the agent is trained on60 simulated apartments and its performance is validated on 15other simulated apartments.In the second phase,the agent willbe evaluated on four simulated apartments and their real-worldcounterparts,totestitsgeneralisationtotherealworld.Inthelastphase,the agent will be

127、evaluated on 10 real-world apartments.Inthissection,webuilduponexistingvisualnavigationsurveypapers 11,23,74 to include more recent works.1)Categories:Point Navigation has been one of the founda-tional and more popular tasks 44 in recent visual navigationliterature.In point navigation,an agent is ta

128、sked to navigateto any position within a certain fixed distance from a specificpoint 11.Generally,the agent is initialized at the origin(0,0,0)in an environment,and the fixed goal point is specified by 3Dcoordinates(x,y,z)relative to the origin/initial location 11.For the task to be completed succes

129、sfully,the artificial agentwould need to possess a diverse range of skillsets such as visualperception,episodic memory construction,reasoning/planning,and navigation.The agent is usually equipped with a GPS andcompass that allows it to access to their location coordinates,and implicitly their orient

130、ation relative to the goal position 17,49.The targets relative goal coordinates can either be static(i.e.givenonlyonce,atthebeginningoftheepisode)ordynamic(i.e.givenateverytime-step)17.Morerecently,withimperfectlocalizationinindoorenvironments,HabitatChallenge2020hasmoved on to the more challenging

131、task 47 of RGBD-basedonline localization without the GPS and compass.There have been many learning-based approaches to pointnavigationinrecentliterature.Oneoftheearlierworks74usesan end-to-end approach to tackle point navigation in a realisticautonomous navigation setting(i.e.unseen environment with

132、no ground-truth maps and no ground-truth agents poses)withdifferent sensory inputs.The base navigation algorithm is theDirect Future Prediction(DFP)94 where relevant inputs suchascolorimage,depthmapandactionsfromthefourmostrecentobservations are processed by appropriate neural networks(e.g.convoluti

133、onal networks for sensory inputs)and concatenated tobe passed into a two-stream fully connected action-expectationnetwork.Theoutputsarethefuturemeasurementpredictionsforall actions and future time steps.The authors also introduce the Belief DFP(BDFP),which isintendedtomaketheDFPsblack-boxpolicymorei

134、nterpretablebyintroducinganintermediatemap-likerepresentationinfuturemeasurementprediction.Thisisinspiredbytheattentionmecha-nisminneuralnetworks,andsuccessorrepresentations95,96and features 97 in reinforcement learning.Experiments showthat the BDFP outperforms the DFP in most cases,classic nav-igat

135、ion approaches generally outperform learning-based oneswith RGB-D inputs.98 provides a more modular approach.For point navigation,SplitNets architecture consists of onevisual encoder and multiple decoders for different auxiliarytasks(e.g.egomotionprediction)andthepolicy.Thesedecodersaim to learn mea

136、ningful representations.With the same PPOalgorithm99andbehavioralcloningtraining,SplitNetcanout-perform comparable end-to-end methods in previously unseenenvironments.Another work presents a modular architecture for simultane-ous mapping and target-driven navigation in indoors environ-ments 48.In th

137、is work,the authors build upon MapNet 71to include 2.5D memory with semantically-informed featuresand train a LSTM for the navigation policy.They show that thismethodoutperformsalearnedLSTMpolicywithoutamap100in previously unseen environments.With the introduction of the Habitat Challenge in 2019 an

138、dits standardized evaluation,dataset and sensor setups,the morerecent approaches have been evaluated with the Habitat Chal-lenge 2019.The first work comes from the team behind Habitat,and uses the PPO algorithm,the actor-critic model structure andaCNNforproducingembeddingsforvisualinputs.Afollow-upw

139、ork provides an“existence proof”that near-perfect results canbeachievedforthepointnavigationtaskforagentswithaGPS,acompass and huge learning steps(2.5 billion steps as comparedto Habitats first PPO work with 75 million steps)in unseenenvironments in simulations 47.Specifically,the best agentsperform

140、ance is within 3-5%of the shortest path oracle.ThisworkusesamodifiedPPOwithGeneralizedAdvantageEstima-tion 101 algorithm that is suited for distributed reinforcementlearning in resource-intensive simulated environments,namelythe Decentralized Distributed Proximal Policy Optimization(DD-PPO).At every

141、 time-step,the agent receives an egocentricobservation(depth or RGB),gets embeddings with a CNN,utilizes its GPS and compass to update the target position tobe relative to its current position,then finally outputs the nextaction and an estimate of the value function.The experimentsshow that the agen

142、ts continue to improve for a long time,andthe results nearly match that of a shortest-path oracle.The next work aims to improve on this resource-intensivework by increasing sample and time efficiency with auxiliarytasks 49.Using the same DD-PPO baseline architecture fromthe previous work,this work a

143、dds three auxiliary tasks:action-conditionalcontrastivepredictivecoding(CPC|A)102,inversedynamics68andtemporaldistanceestimation.Theauthorsex-periment with different ways of combining the representations.At 40 million frames,the best performing agent achieves thesame performance as the previous work

144、 5.5X faster and evenhasimprovedperformance.ThewinneroftheHabitatChallenge2019 for both the RGB and the RGB-D tracks 44 provides ahybrid solution that combines both classic and learning-basedapproachesasend-to-endlearning-basedapproachesarecompu-tationally expensive.This work incorporates learning i

145、n a mod-ular fashion into a“classic navigation pipeline,”thus implicitlyincorporating the knowledge of obstacle avoidance and controlin low-level navigation.The architecture consists of a learnedNeural SLAM module,a global policy,a local policy and ananalytical path planner.The Neural SLAM module pr

146、edicts amap and agent pose estimate using observations and sensors.The global policy always outputs the target coordinates as thelong-term goal,which is converted to a short-term goal usingthe analytic path planner.Finally,a local policy is trained toAuthorized licensed use limited to:ASTAR.Download

147、ed on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS9navigate to thi

148、s short-term goal.The modular design and useof analytical planning help to reduce the search space duringtraining significantly.Object Navigation is one of the most straightforward tasks,yet one of the most challenging tasks in embodied AI.Objectnavigation focuses on the fundamental idea of navigati

149、ng to anobject specified by its label in an unexplored environment 38.The agent will be initialized at a random position and willbe tasked to find an instance of an object category withinthat environment.Object navigation is generally more complexthan point navigation,since it not only requires many

150、 of thesame skillsets such as visual perception and episodic memoryconstruction,but also semantic understanding.These are whatmakes the object navigation task much more challenging,butalso rewarding to solve.The task of object navigation can be demonstrated or learntthrough adapting,which helps to g

151、eneralize navigation in anenvironment without any direct supervision.This work 51achieves that through a meta-reinforcement learning approach,as the agent learns a self-supervised interaction loss whichhelpstoencourageeffectivenavigation.Unliketheconventionalnavigation approaches for which the agent

152、s freeze the learningmodel during inference,this work allows the agent learns toadapt itself in a self-supervised manner and adjust or correct itsmistake afterwards.This approach prevents an agent from mak-ing too many mistakes before realizing and make the necessarycorrection.Anothermethodistolearn

153、theobjectrelationshipbe-tween objects before executing the planning of navigation.Thiswork 53 implements an object relation graph(ORG)whichis not from external prior knowledge but rather a knowledgegraphthatisbuiltduringthevisualexplorationphase.Thegraphconsists of object relationships such as categ

154、ory closeness andspatial correlations.Navigation with Priors focuses on the idea of injecting se-mantic knowledge or priors in the form of multimodal inputssuch as knowledge graph or audio input or to aid in the trainingof navigation tasks for embodied AI agents in both seen andunseen environments.P

155、ast work 57 that use human priorsof knowledge integrated into a deep reinforcement learningframework has shown that artificial agent can tap onto human-like semantic/functional priors to aid the agent in learning tonavigate and find unseen objects in the unseen environment.Such example taps onto the

156、 understanding that the items ofinterest,such as finding an apple in the kitchen,humans willtend to look at logical locations to begin our search.Theseknowledge are encoded in a graph network and trained uponin a deep reinforcement learning framework.There are other examples of using human priors su

157、ch ashumans ability to perceive and capture correspondences be-tweenanaudiosignalmodalandthephysicallocationofobjectshence to perform navigation to the source of the signal.In thiswork 103,artificial agents pick multiple sensory observationssuch as vision and sound signal of the target objects and f

158、igureouttheshortesttrajectorytonavigationfromitsstartinglocationto the source of the sounds.This work achieves it throughhaving a visual perception mapper,sound perception moduleand dynamic path planners.Vision-and-Language Navigation(VLN)is a task whereagents learn to navigate the environment by fo

159、llowing naturallanguage instructions.The challenging aspect of this task isto perceive both the visual scene and language sequentially.VLN remains a challenging task as it requires agents to makepredictions of future actions based on past actions and in-structions 11.Furthermore,agents might not be

160、able to aligntheir trajectories seamlessly with natural language instructions.Although vision-and-language navigation and visual questionanswering(VQA)might seem similar,there are major differ-ences in both tasks.Both tasks can be formulated as visuallygrounded,sequence-to-sequence transcoding probl

161、ems.How-ever,VLN sequences are much longer and require vision data tobe constantly fed as input and the ability to manipulate cameraviewpoints,as compared to VQA where a single input questionis fed in and an answer is generated.We are now able to givea natural language instruction to a robot and exp

162、ect them toperform the task 2,3,58.These are achieved with theadvancement of recurrent neural network methods 58 for jointinterpretation of both visual and natural language inputs anddatasets that are designed for simplifying processes of task-based instruction in navigation and performing of tasks

163、in the3D environment.One approach for VLN is the Auxiliary Reasoning Naviga-tion framework 59.It tackles four auxiliary reasoning tasks:trajectory retelling,progress estimation,angle prediction andcross-modal matching.The agent learns to reason about theprevious actions and predicts future informati

164、on the tasks.Vision-dialog navigation is the latest extension of VLN asit aims to train an agent to develop the ability to engage ina constant natural language conversation with humans to aidin its navigation.The current work 60 in this area uses aCross-modal Memory Network(CMN)that remembers andund

165、erstandsusefulinformationrelatedtopastnavigationactionsthroughseparatelanguagememoryandvisualmemorymodules,and further uses it to make decisions for navigation.2)Evaluation Metrics:Visual navigation uses(1)successweightedby(normalizedinverse)pathlength(SPL)and(2)suc-cess rate as the main evaluation

166、metrics 11.Success weightedby path length can be defined as:1N?Ni=1Silimax(pi,li).Siis a success indicator for episode i,piis the agents pathlength,liis the shortest path length and N is the number ofepisodes.It is noteworthy that there are some known issueswith success weighted by path length 38.Su

167、ccess rate isthe fraction of the episodes in which the agent reaches thegoal within the time budget 74.There are also other lesscommon evaluation metrics 11,48,52,54,74 in ad-dition to the two mentioned,namely:(3)path length ratio,which is the ratio between the predicted path and the short-est path

168、length and is calculated only for successful episodes;(4)distance to success/navigation error,which measures thedistance between the agents final position and the successthresholdboundaryaroundthenearestobjectorthegoallocationrespectively.Besides the above four metrics,there are another two metricsu

169、sed to evaluate VLN agents.They are:(1)oracle success rate,the rate for which the agent stops at the closest point to the goalAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a

170、future issue of this journal.Content is final as presented,with the exception of pagination.10IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEalong its trajectory;(2)trajectory length.In general,for VLNtasks,the best metric is still SPL as it takes into account of thepath taken and

171、 not just the goal.For vision-dialog navigation,in addition to success rate andoracle success rate,there are another two metrics used:(1)goalprogress,the average agent progress towards the goal location;(2)oracle path success rate,the success rate of agent stoppingat the closest point to goal along

172、the shortest path.3)Datasets:As in visual exploration,Matterport3D andGibson V1 are the most popular datasets.It is noteworthy thatthe scenes in Gibson V1 are smaller and usually have shorterepisodes(lower GDSP from start position to goal position).TheAI2-THOR simulator/dataset is also used.Unlike t

173、he rest of the visual navigation tasks,VLN requiresa different kind of dataset.Most of the VLN works use theRoom-to-Room(R2R)dataset with the Matterport3D Simula-tor 104.It consists of 21,567 navigation instructions with anaverage length of 29 words.In vision-dialog navigation 59,the Cooperative Vis

174、ion-and-Dialog Navigation(CVDN)105datasetisused.Itcomprises2,050human-to-humandialogsandover 7,000 trajectories within the Matterport3D Simulator.C.Embodied Question AnsweringThe task of embodied question answering(QA)in recentembodied AI simulators has been a significant advancement inthe field of

175、general-purpose intelligence systems.To performQA in a state of physical embodiment,an AI agent wouldneed to possess a wide range of AI capabilities such as visualrecognition,language understanding,question answering,com-monsense reasoning,task planning,and goal-driven navigation.Hence,embodied QA c

176、an be considered the most onerous andcomplicated task in embodied AI research currently.1)Categories:For embodied QA(EQA),a common frame-work that divides the task into two sub-tasks:a navigation taskand a QA task.The navigation module is essential since theagent needs to explore the environment to

177、see the objects beforeanswering questions about them.For example,61 proposedthe Planner-Controller Navigation Module(PACMAN),whichcomprises a hierarchical structure for the navigation module,with a planner that selects actions(directions)and a controllerthat decides how far to move following each ac

178、tion.Once theagent decide to stop,the QA module is executed by using thesequence of frames along its path.The navigation module andvisual question answering module are first trained individuallyand then jointly trained by REINFORCE 106.62 and 63further improved the PACMAN model with the Neural Modul

179、arControl(NMC)where the higher-level master policy proposessemantic sub-goals to be executed by sub-policies.Multi-targetembodiedQA(MT-EQA)63isamorecomplexembodied QA task,which studies questions that have multipletargets in them,e.g.“Is the apple in the bedroom bigger than theorange in the living r

180、oom?,”such that the agent has to navigateto the“bedroom”and the“living room”to localize the“apple”and the“orange”and then perform comparisons to answer thequestions.Interactive Question Answering(IQA)64 is another worktackling the task of embodied QA in the AI2-THOR envi-ronment.IQA is an extension

181、of EQA because it is essentialfor the agent to interact with the objects to answer certainquestions successfully(e.g.the agent needs to open the re-frigerator to answer the existence question“Is there an eggin the fridge?”).64 proposed using a Hierarchical InteractiveMemory Network(HIMN),which is a

182、hierarchy of controllersthat help the system operate,learn and reason across multipletime scales,while simultaneously reducing the complexity ofeach sub-task.An Egocentric Spatial Gated Recurrent Unit(GRU)acts as a memory unit for retaining spatial and semanticinformation of the environment.The plan

183、ner module will havecontrol over the other modules such as a navigator which runsan A*search to find the shortest path to the goal,a scannerwhichperformsrotationfordetectingnewimages,amanipulatorthat is invoked to carry out actions to change the state of theenvironmentandlastlyananswererthatwillansw

184、erthequestionposted to the AI agent.65 studied IQA from a multi-agentperspective,where several agents explore an interactive scenejointlytoansweraquestion.65proposedmulti-layerstructuraland semantic memories as scene memories to be shared bymultiple agents to first reconstruct the 3D scenes and then

185、perform QA.2)Evaluation Metrics:Embodied QA and IQA involve twosub-tasks:1)navigation,and 2)question answering,and thesetwo sub-tasks are evaluated based on different metrics.Navigationperformanceisevaluatedby:(1)distancetotargetat navigation termination,i.e.navigation error(dT);(2)changein distance

186、 to target from initial to final position,i.e.goalprogress(d);(3)smallest distance to target at any point inthe episode(dmin);(4)percentage of episodes agent terminatesnavigation foranswering before reaching the maximum episodelength(%stop);(5)percentage of questions where the agentterminates in the

187、 room containing the target object(%rT);(6)percentage of questions where the agent enters the roomcontaining the target object at least once(%re);(7)Intersectionover Union for target object(IoU);(8)hit accuracy based onIoU(hT);(9)episode length,i.e.trajectory length.Metrics(1),(2)and(9)are also used

188、 as evaluation metrics for the visualnavigation task.QA performance is evaluated by:(1)mean rank(MR)of theground-truth answer in predictions;(2)accuracy.3)Datasets:The EQA 61 dataset is based on House3D,a subset of the popular SUNCG 33 dataset with synthesizedrooms and layouts that is similar to the

189、 Replica dataset 107.House3D converts SUNCGs static environment into a virtualenvironment,where the agent can navigate with physical con-straints(e.g.it cannot pass through walls or objects).To test theagents capabilities in language grounding,commonsense rea-soning and navigation,61 uses a series o

190、f functional programsin CLEVR 108 to synthesize questions and answers regardingobjects and their properties(e.g.color,existence,location andrelative preposition).In total,there are 5,000 questions in 750environments with reference to 45 unique objects in 7 uniqueroom types.Authorized licensed use li

191、mited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARC

192、H TASKS11ForMT-EQA63,theauthorsintroducetheMT-EQAdataset,which contains 6 types of compositional questions which com-pare object attribute properties(color,size,distance)betweenmultiple targets(objects/rooms).For IQA 64,the authors annotated a large scale dataset,IQUAD V1,which consist of 75,000 mul

193、tiple-choice questions.Similar to the EQA dataset,IQUAD V1 has questions regardingobject existence,counting and spatial relationships.IV.INSIGHTS ANDCHALLENGESA.Insights Into Embodied AIThe interconnections in Fig.6 reflects the suitability of sim-ulators to research tasks.Based on Fig.6,both Habita

194、t-Simand iGibson support research tasks in visual exploration and arange of visual navigation tasks,indicating the importance ofhigh fidelity,which comes from world-based scene simulators.However,because of their distinct unique features that makethempreferablefornon-embodiedAIstandalonetaskssuchasi

195、ndeep reinforcement learning,some simulators do not presentlyconnect to any of the embodied research tasks.Nonetheless,they still meet the criteria for being classified as embodied AIsimulators.On the contrary,research tasks such as embodied questionanswering and visual navigation with priors would

196、require theembodied AI simulators to have multiple-state object property,due to the interactive nature of these tasks.Hence,AI2-THORis undoubtedly the simulator of choice.Lastly,VLN is theonly research task that currently does not utilize any of thenine embodied AI simulators but instead uses Matter

197、port3DSimulator 104.This is because previous works in VLN doesnot require the feature of interactivity in its simulator;henceMatterport3D simulator suffice.However,with the furtheranceof VLN tasks,we can expect the need for interactions in VLNtasks,hence the need to use embodied AI simulators.Fur-th

198、ermore,unlike traditional reinforcement learning simulationenvironments 41,109 focus on task specific training,whileembodied AI simulators provide a training environment fortraining a wide range of different tasks akin to those undertakenin the physical world.Furthermore,based on the survey done on

199、the embodied AIresearch tasks in Section III,we propose a pyramid structurein which each embodied AI research task contributes to thenext.Visual exploration,for example,aids in the developmentof visual navigation,and visual navigation contributes to thecreation of embodied QA.This build-up approach

200、also corre-lates with the increasing complexity of the tasks.Based on theforeseeabletrendsinembodiedAIresearch,wehypothesizethata next advancement in the pyramid of embodied AI research isTask-based Interactive Question Answering(TIQA),whichaims to integrate tasks with answering specific questions.F

201、orexample,such questions can be“How long would it take foran egg to boil?Is there an apple in the cabinet?”.These arequestions that cannot be answered through the conventionalapproaches 61,64.They require the embodied agent toperform specific tasks related to the questions to unlock newinsights that

202、 are momentous in answering those QA questions.The TIQA agents that we hypothesize can perform an arrayof general household tasks,which allows them to extrapolateuseful environmental information that is crucial in helping themtoderivetheanswertotheQAquestions.TIQAmayholdthekeyto generalizing task-pl

203、anning and developing general-purposeAIinsimulationswhichlatercanbedeployedintotherealworld.B.Challenges in Embodied AI SimulatorsCurrent embodied AI simulators have reached a level in bothits functionality and fidelity,that sets them apart from thoseconventional simulation used for reinforcement le

204、arning.Evenwith this soaring variance of embodied AI simulators,there areseveral existing challenges in embodied AI simulators in areasranging from their realism,scalability to interactivity.1)Realism:It focuses on the fidelity and physics features ofthe simulators.Simulators with both a high visual

205、 fidelity andrealistic physics are highly sought after by the robotics commu-nities as they provide the ideal test-bed for various robotic taskssuch as navigation and interaction tasks 110,111.However,there is a lack of embodied AI simulators that possess both ofworld-based scene and advanced physic

206、s.For fidelity,simulators that are world-based scene will un-doubtedlyoutperformgame-basedscenesimulatorinsimulationto real tasks 27,112.Despite this observation,only Habitat-Sim 17,and iGibson 18 are world-based scene simulators.This paucity of world-based scene simulators is the bottleneckto simul

207、ation-to-real tasks for embodied AI agents,which fur-ther hinders the transferability of embodied AI research intoreal-world deployment.For physics,the furtherance of physics-based predictive models 113115 have accentuate on theimportance of embodied AI simulators with advanced physicsfeatures as th

208、ey serve to provide an ideal testbed for trainingembodiedAIagentstoperformtaskswithsophisticatedphysicalinteractions 2,3,116.Despite the need for an advancedphysics-based embodied AI simulator,there is currently onlyone simulator,ThreeDWorld 20 that fits this criterion.Hence,there is a severe lack o

209、f embodied AI simulators with advancedphysics features such as cloth,fluid and soft-body physics.We believe that advances in 3D reconstruction techniques andphysicsengines117119willimprovetherealismofembod-ied AI.2)Scalability:Unlike image-based datasets 7,120 whichcan be easily obtained from crowd-

210、sourcing or the internet.Themethodologies and tools are scarce for collecting large-scaleworld-based3Dscenedatasetsand3Dobjectassets107,121,122.These 3D scene datasets are crucial for the constructionof a diverse of embodied AI simulators.Current approachesto collect realistic 3D scene datasets requ

211、ires scanning of thephysical room through photogrammetry 123 such as Matter-port 3D scanner,Meshroom 124,or even mobile 3D scanningapplications.However,they are not commercially viable forcollectinglargescale3Dobjectsandscenescans.Thisislargelydue to 3D scanners that are used for photogrammetry are

212、costlyand non-accessible.As such,the bottleneck to scalability liesin developing tools for large scale collection of high fidelity 3Dobject or scene scans.Hopefully,with the further advancementAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Re

213、strictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.12IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEof3Dlearning-basedapproaches125,126thataimstorender3D object meshes from

214、a single or few images or even throughscene generation approach 127,we will be able to scale up thecollection process of large scale 3D datasets.3)Interactivity:The ability to have fine-grained manipula-tive interactions with functional objects in the embodied AIsimulators are crucial in replicating

215、 human-level interactionswith real-world objects 128.Most game-based scene simu-lators 13,16,19,20 provides both fine-grained objectmanipulation capabilities and symbolic interaction capabilities(e.g.action)or simply a point-and-select.However,due to the nature of game-based scenesimulators,many res

216、earch tasks performed in this environmentwill opt for its symbolic interaction capabilities as compared tofine-grainedobjectmanipulation3,exceptforafewthatutilizeboth 2,129.On the other end,the agents from world-based scene simula-tors 17,18 possess the ability for gross motor control insteadof the

217、symbolic interaction capabilities.However,the objectproperty of the objects within these simulators being largelyinteract-ableonthesurfacewhichallowsforgrossmotorcontrolbut lacks the multi-state object classes which is number of statechanges that the object have.Hence,there is a need to strike abala

218、nceinboththeobjectfunctionalityinitsobjectpropertyandalso the complexity of action that the embodied AI agent canperform in the environment.Undoubtedly,mainstreamsimulatorssuchasAI2-THOR 13,iGibson 18,and Habitat-Sim 17 do provide anexcellent environment for advancing the respective embodiedAI resea

219、rch.However,they do have their strengths andlimitations to be overcome.With developments in computergraphicsandcomputervision,andtheintroductionofinnovativereal-world datasets,real-to-sim domain adaptation is one ofthe clear routes for improving embodied AI simulators.Theconcept of real-to-sim revol

220、ves around capturing real-worldinformationsuchastactileperception 130,human-level motorcontrol 131 and audio inputs 132 in addition to visualsensory inputs and integrating them for the development ofmore realistic embodied AI simulators that can effectivelybridge the physical and virtual worlds.C.Ch

221、allenges in Embodied AI ResearchEmbodied AI research tasks mark an increase in complexityfrom“internet AI”to autonomous embodied learning agents in3D simulated environments with multiple sensor modalities andpotentially long trajectories 22,34.This has led to memoryand internal representations of th

222、e agent becoming extremelyimportant 11,22,56.Long trajectories and multiple inputtypes also signified the importance of robust memory architec-ture which allows the agent to focus on the important parts ofits environment.In recent years,there has been many differenttypes of memory used,such as recur

223、rent neural networks 47,49,51,56,58,6163,attention-based memory ar-chitectures,52,60,72,anticipated occupancy maps 45,occupancy maps 22 and semantic maps 43,46,48,64,65,with some papers having overwhelming emphasis on thenovelty of their memory architectures 22,45,60,72.However,while recurrent neura

224、l networks are known to be lim-ited in capturing long-term dependencies in embodied AI 56,72,it is currently still hard to agree which memory type(s)are better 11 due to the lack of work focusing on memoryarchitectures.Among embodied AI research tasks,there has also been anincrease in complexity,as

225、seen in the progression from visualexploration to VLN and embodied QA where new componentslike language understanding and QA are added respectively.Each new component leads to exponentially harder and longertraining of AI agents,especially since current approaches areoften fully learning-based.This

226、phenomenon has led to twopromising advancements to reduce the search space and samplecomplexity while improving robustness:hybrid approachescombining classic and learning-based algorithms 44,74 andprior knowledge incorporation 23,57.Furthermore,abla-tionstudiesaremuchhardertomanage31formorecomplexta

227、sks as each new component in embodied AI makes it muchharder to test for its contribution to the agents performance,sinceitisaddedontoanexistingsetofcomponents,andembod-iedAIsimulatorsvarysignificantlyinfeaturesandissues.Thisiscompounded by the fact that research tasks have also increasedin number r

228、apidly.As a result,while some fundamental taskslike visual exploration have received more attention and thushave more approaches tackling them,the newer and more nichetasks like MT-EQA are much less addressed.New tasks usuallyintroducenewconsiderationsinimportantaspectslikemethods,evaluation metrics

229、 22,input types and model components,shown in Table III,thus requiring even more evaluation thansimpler tasks like visual exploration.Lastly,there is a lack of focus on multi-agent set-ups,whichcontribute useful new tasks 65.This lack of focus can beattributed to the lack of simulators with multi-ag

230、ent featuresuntil recently.Multi-agent systems for collaboration and com-munication are prevalent in the real world 133,134 butcurrentlyreceiverelativelylittleattention31.Withanincreasein simulators with multi-agent features 13,20,55 recently,it remains to be seen whether the multi-agent support(e.g

231、.support for multi-agent algorithms)is sufficient.V.CONCLUSIONRecent advances in embodied AI simulators have been akey driver of progress in embodied AI research.Aiming tounderstand the trends and gaps in embodied AI simulators andresearch,thispaperprovidesacontemporaryandcomprehensiveoverview of em

232、bodied AI simulators and research.The papersurveys nine embodied AI simulators and their connections inserving and driving recent innovations in research tasks forembodiedAI.BybenchmarkingnineembodiedAIsimulatorsintermsofsevenfeatures,weseektounderstandtheirprovisionofrealism,scalability and interac

233、tivity,and hence use in embodiedAI research.The three main tasks supporting the pyramid ofembodiedAIresearchvisualexploration,visualnavigationandembodiedQA,areexaminedintermsoftheirapproaches,evalu-ation metrics,and datasets.This is to review and benchmark theAuthorized licensed use limited to:ASTAR

234、.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS13exist

235、ing approaches in tackling these categories of embodied AIresearch tasks in the various embodied AI simulators.Further-more,this paper allows us to unveil insightful relations betweenthe simulators,datasets,and research tasks.With the aid of thispaper,AI researchers new to this field would be able t

236、o selectthemostsuitableembodiedAIsimulatorsfortheirresearchtasksand contribute back to advancing the field of embodied AI.REFERENCES1 L.Smith and M.Gasser,“The development of embodied cognition:Sixlessons from babies,”Artif.Life,vol.11,no.1/2,pp.1329,2005.2 J.Duan,S.Yu,H.L.Tan,and C.Tan,“Actionet:An

237、 interactiveend-to-end platform for task-based data collection and augmentationin 3D environment,”in Proc.IEEE Int.Conf.Image Process.,2020,pp.15661570.3 M.Shridhar et al.,“ALFRED:A benchmark for interpreting groundedinstructions for everyday tasks,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.

238、,2020,pp.1073710746.4 R.Pfeifer and F.Iida,“Embodied artificial intelligence:Trends and chal-lenges,”inEmbodiedArtificialIntelligence.Berlin,Heidelberg:Springer,2004,pp.126.5 J.Haugeland,Artificial Intelligence:The Very Idea.Cambridge,MA,USA:MIT Press,1985.6 R.Pfeifer and J.C.Bongard,How the Body Sh

239、apes the Way We Think-A New View on Intelligence.MIT press,2006.7 J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,and L.Fei-Fei,“ImageNet:Alarge-scalehierarchicalimagedatabase,”inProc.IEEEConf.Comput.Vis.Pattern Recognit.,2009,pp.248255.8 Y.LeCun,Y.Bengio,and G.Hinton,“Deep learning,”Nature,vol.521,no.7553,pp.4

240、36444,2015.9 K.He,X.Zhang,S.Ren,and J.Sun,“Deep residual learning for imagerecognition,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2016,pp.770778.10 D.Silver et al.,“Mastering the game of go without human knowledge,”Nature,vol.550,no.7676,pp.354359,2017.11 P.Andersonetal.,“Onevaluationofembodied

241、navigationagents,”2018,arXiv:1807.06757.12 C.Beattie et al.,“Deepmind lab,”2016,arXiv:1612.03801.13 E.Kolve et al.,“Ai2-thor:An interactive 3D environment for visual AI,”2017,arXiv:1712.05474.14 C.Yan,D.Misra,A.Bennnett,A.Walsman,Y.Bisk,and Y.Artzi,“Chalet:Cornell house agent learning environment,”2

242、018,arXiv:1801.07357.15 X.Puig et al.,“Virtualhome:Simulating household activities via pro-grams,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.84948502.16 X.Gao,R.Gong,T.Shu,X.Xie,S.Wang,and S.-C.Zhu,“Vrkitchen:An interactive 3D virtual environment for task-oriented learning,”2019,arXi

243、v:1903.05757.17 M.Savvaetal.,“Habitat:AplatformforembodiedAIresearch,”inProc.IEEE Int.Conf.Comput.Vis.,2019,pp.93399347.18 F.Xiaetal.,“InteractiveGibsonbenchmark:Abenchmarkforinteractivenavigationinclutteredenvironments,”IEEERobot.Automat.Lett.,vol.5,no.2,pp.713720,Apr.2020.19 F.Xiang et al.,“SAPIEN

244、:A simulated part-based interactive environ-ment,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.11 09711 107.20 C.Gan et al.,“Threedworld:A platform for interactive multi-modalphysical simulation,”2020,arXiv:2007.04954.21 M.G.Bellemare,Y.Naddaf,J.Veness,and M.Bowling,“The arcadelearning

245、environment:Anevaluationplatformforgeneralagents,”J.Artif.Intell.Res.,vol.47,pp.253279,2013.22 S.K.Ramakrishnan,D.Jayaraman,and K.Grauman,“An explo-ration of embodied visual exploration,”Int.J.Comput.Vis.,vol.129,no.5,pp.16161649,2021.Online.Available:https:/doi.org/10.1007/s11263-021-01437-z23 X.Ye

246、 and Y.Yang,“From seeing to moving:A survey on learning forvisual indoor navigation(VIN),”2020,arXiv:2002.11310.24 T.Chen,S.Gupta,and A.Gupta,“Learning exploration policies fornavigation,”2019,arXiv:1903.01959.25 N.Savinov,A.Dosovitskiy,andV.Koltun,“Semi-parametrictopologicalmemory for navigation,”i

247、n Proc.Int.Conf.Learn.Representations,2018,arXiv:1803.00653.26 E.Beeching,J.Dibangoye,O.Simonin,and C.Wolf,“Learning to planwith uncertain topological maps,”2020,arXiv:2007.05270.27 A.Kadian et al.,“Sim2real predictivity:Does evaluation in simulationpredict real-world performance,”IEEE Robot.Automat

248、.Lett.,vol.5,no.4,pp.66706677,Oct.2020.28 X.B.Peng,M.Andrychowicz,W.Zaremba,and P.Abbeel,“Sim-to-realtransfer of robotic control with dynamics randomization,”in Proc.IEEEInt.Conf.Robot.Automat.,2018,pp.18.29 J.Tobin,R.Fong,A.Ray,J.Schneider,W.Zaremba,and P.Abbeel,“Domain randomization for transferri

249、ng deep neural networks fromsimulation to the real world,”in Proc.IEEE/RSJ Int.Conf.Intell.RobotsSyst.,2017,pp.2330.30 CVPR,“Embodied AI workshop,”Jun.2021,Accessed:13 Jan 2022.Online.Available:https:/embodied-ai.org/31 L.Weihsetal.,“Allenact:AframeworkforembodiedAIresearch,”2020,arXiv:2008.12760.32

250、 K.Mo et al.,“Partnet:A large-scale benchmark for fine-grained andhierarchical part-level 3D object understanding,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2019,pp.909918.33 S.Song,F.Yu,A.Zeng,A.X.Chang,M.Savva,and T.Funkhouser,“Semanticscenecompletionfromasingledepthimage,”Proc.30thIEEEConf.C

251、omput.Vis.Pattern Recognit.,2017,pp.190198.34 A.Chang et al.,“Matterport3D:Learning from RGB-D data in indoorenvironments,”in Proc.Int.Conf.3D Vis.,2017,pp.667676.35 F.Xia,A.R.Zamir,Z.He,A.Sax,J.Malik,and S.Savarese,“Gibsonenv:Real-world perception for embodied agents,”in Proc.IEEE/CVFConf.Comput.Vi

252、s.Pattern Recognit.,2018,pp.90689079.36 A.Kadian et al.,“Are we making real progress in simulated environ-ments?Measuringthesim2realgapinembodiedvisualnavigation,”2019,arXiv:1912.06321.37 M.Deitke et al.,“Robothor:An open simulation-to-real embodied AIplatform,”in Proc.IEEE/CVF Conf.Comput.Vis.Patte

253、rn Recognit.,2020,pp.31643174.38 D.Batra et al.,“Objectnav revisited:On evaluation of embodied agentsnavigating to objects,”2020,arXiv:2006.13171.39 U.Jain etal.,“Acordialsync:Goingbeyondmarginalpoliciesformulti-agent embodied tasks,”in European Conference on Computer Vision,Cham,Switzerland:Springe

254、r,2020,pp.471490.40 U.Jain et al.,“Two body problem:Collaborative visual task comple-tion,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2019,pp.66826692,.41 G.Brockman et al.,“OpenAI gym,”2016,arXiv:1606.01540.42 U.Jain et al.,“Two body problem:Collaborative visual task comple-tion,”in Proc.IE

255、EE/CVF Conf.Comput.Vis.Pattern Recognit.,2019,pp.66826692.43 D.S.Chaplot,H.Jiang,S.Gupta,and A.Gupta,“Semantic curiosityfor active visual learning,”in Proc.Eur.Conf.Comput.Vis.,2020,pp.309326.44 D.S.Chaplot,D.Gandhi,S.Gupta,A.Gupta,and R.Salakhutdinov,“Learning to explore using active neural SLAM,”i

256、n Proc.Int.Conf.Learn.Representations,2020.45 S.K.Ramakrishnan,Z.Al-Halah,and K.Grauman,“Occupancy antici-pationforefficientexplorationandnavigation,”2020,arXiv:2008.09285.46 M.Narasimhan et al.,“Seeing the un-scene:Learning amodal semanticmaps for room navigation,”in Proc.Eur.Conf.Comput.Vis.,Cham,

257、Switzerland,Springer,2020,pp.513529.47 E.Wijmans et al.,“DD-PPO:Learning near-perfect pointgoalnavigatorsfrom 2.5 billion frames,”2019,arXiv:1911.00357.48 G.Georgakis,Y.Li,and J.Kosecka,“Simultaneous mapping and targetdriven navigation,”2019,arXiv:1911.07980.49 J.Ye,D.Batra,E.Wijmans,and A.Das,“Auxi

258、liary tasks speed uplearning pointgoal navigation,”2020,arXiv:2007.04561.50 C.Prez-DArpino,C.Liu,P.Goebel,R.Martn-Martn,andS.Savarese,“Robot navigation in constrained pedestrian environments using rein-forcement learning,”2020,arXiv:2010.08600.51 M.Wortsman,K.Ehsani,M.Rastegari,A.Farhadi,and R.Motta

259、ghi,“Learning to learn how to learn:Self-adaptive visual navigation usingmeta-learning,”inProc.IEEE/CVFConf.Comput.Vis.PatternRecognit.,2019,pp.67436752.52 T.Campari,P.Eccher,L.Serafini,and L.Ballan,“Exploiting scene-specific features for object goal navigation,”in Proc.Eur.Conf.Comput.Vis.Workshops

260、,2020,pp.406421.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.14IEEE TRANSACTION

261、S ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE53 H.Du,X.Yu,andL.Zheng,“Learningobjectrelationgraphandtentativepolicy for visual navigation,”in Proc.Eur.Conf.Comput.Vis.,Cham,Switzerland,Springer,2020,pp.1934.54 D.S.Chaplot,D.Gandhi,A.Gupta,and R.Salakhutdinov,“Ob-ject goal navigation using goal-

262、oriented semantic exploration,”2020,arXiv:2007.00643.55 B.Shen et al.,“iGibson,A simulation environment for interactive tasksin large realistic scenes,”Proc.IEEE/RSJ Int.Conf.Intell.Robots Syst.,2021.56 A.Wahid,A.Stone,K.Chen,B.Ichter,and A.Toshev,“Learningobject-conditioned exploration using distri

263、buted soft actor critic,”2020,arXiv:2007.14545.57 W.Yang,X.Wang,A.Farhadi,A.Gupta,and R.Mottaghi,“Visualsemantic navigation using scene priors,”2018,arXiv:1810.06543.58 P.Andersonetal.,“Vision-and-languagenavigation:Interpret-ing visually-grounded navigation instructions in real environments,”inProc

264、.IEEEConf.Comput.Vis.PatternRecognit.,2018,pp.36743683.59 F.Zhu,Y.Zhu,X.Chang,and X.Liang,“Vision-language navigationwithself-supervisedauxiliaryreasoningtasks,”inProc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,2020,pp.1000910019.60 Y.Zhu et al.,“Vision-dialog navigation by exploring cross-modal mem-

265、ory,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.10 72710 736.61 A.Das,S.Datta,G.Gkioxari,S.Lee,D.Parikh,andD.Batra,“Embodiedquestionanswering,”inProc.IEEEConf.Comput.Vis.PatternRecognit.Workshops,2018,pp.20542063.62 A.Das,G.Gkioxari,S.Lee,D.Parikh,and D.Batra,“Neural modularcontrol f

266、or embodied question answering,”in Proc.Conf.Robot Learn.,2018,pp.5362.63 L.Yu,X.Chen,G.Gkioxari,M.Bansal,T.L.Berg,and D.Ba-tra,“Multi-target embodied question answering,”in Proc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,Long Beach,CA,USA,2019,pp.63026311.64 D.Gordon,A.Kembhavi,M.Rastegari,J.Redmon,

267、D.Fox,andA.Farhadi,“IQA:Visual question answering in interactive environ-ments,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.40894098.65 S.Tan,W.Xiang,H.Liu,D.Guo,and F.Sun,“Multi-agent embodiedquestion answering in interactive environments,”in Proc.Comput.Vis.ECCV:16th Eur.Conf.,Glasg

268、ow,UK,2020,pp.663678.66 P.D.Nguyen,Y.K.Georgie,E.Kayhan,M.Eppe,V.V.Hafner,andS.Wermter,“Sensorimotor representation learning for an“active self”in robots:A model survey,”KI-Knstliche Intelligenz,vol.35,no.1,pp.935,2021.67 C.Chen et al.,“Soundspaces:Audio-Visual Navigation in 3D Environ-ments,”in ECC

269、V,Glasgow,U.K.,Springer,2020,pp.1736.68 D.Pathak,P.Agrawal,A.A.Efros,and T.Darrell,“Curiosity-drivenexploration by self-supervised prediction,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.Workshops,2017,pp.1617.69 S.Gupta,D.Fouhey,S.Levine,and J.Malik,“Unifying mapand landmark based representations

270、 for visual navigation,”2017,arXiv:1712.08125.70 S.Gupta,J.Davidson,S.Levine,R.Sukthankar,and J.Malik,“Cogni-tive mapping and planning for visual navigation,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2017,pp.72727281.71 J.F.Henriques and A.Vedaldi,“MapNet:An allocentric spatial memoryfor mappin

271、g environments,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.84768484.72 K.Fang,A.Toshev,L.Fei-Fei,and S.Savarese,“Scene memory trans-former for embodied agents in long-horizon tasks,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2019,pp.538547.73 L.Mezghani,S.Sukhbaatar,A.Szlam,A.Jou

272、lin,and P.Bojanowski,“Learning to visually navigate in photorealistic environments withoutany supervision,”2020,arXiv:2004.04954.74 D.Mishkin,A.Dosovitskiy,and V.Koltun,“Benchmarking clas-sic and learned navigation in complex 3D environments,”2019,arXiv:1901.10915.75 C.Cadena et al.,“Past,present,an

273、d future of simultaneous localizationand mapping:Toward the robust-perception age,”IEEE Trans.Robot.,vol.32,no.6,pp.13091332,Dec.2016.76 S.K.Ramakrishnan,D.Jayaraman,and K.Grauman,“Emergence of ex-ploratorylook-aroundbehaviorsthroughactiveobservationcompletion,”Sci.Robot.,vol.4,pp.30,2019.77 W.S.Lov

274、ejoy,“A survey of algorithmic methods for partially observedMarkov decision processes,”Ann.Oper.Res.,vol.28,no.1,pp.4765,1991.78 B.Yamauchi,“A frontier-based approach for autonomous exploration,”inProc.IEEEInt.Symp.Comput.Intell.Robot.Automat.,1997,pp.146151.79 Y.Burda,H.Edwards,D.Pathak,A.Storkey,T

275、.Darrell,and A.A.Efros,“Large-scale study of curiosity-driven learning,”2018,arXiv:1808.04355.80 R.Houthooft,X.Chen,Y.Duan,J.Schulman,F.DeTurck,andP.Abbeel,“VIME:Variational information maximizing exploration,”in Proc.Adv.Neural Inf.Process.Syst.,2016,pp.11091117.81 D.Pathak,D.Gandhi,and A.Gupta,“Se

276、lf-supervised exploration viadisagreement,”in Proc.Int.Conf.Mach.Learn.,2019,pp.50625071.82 Y.Burda,H.Edwards,A.Storkey,and O.Klimov,“Exploration byrandom network distillation,”2018,arXiv:1810.12894.83 A.Vaswani et al.,“Attention is all you need,”in Proc.Adv.Neural Inf.Process.Syst.,2017,pp.59986008

277、.84 S.K.RamakrishnanandK.Grauman,“Sidekickpolicylearningforactivevisual exploration,”in Proc.Eur.Conf.Comput.Vis.,2018,pp.413430.85 S.Song,A.Zeng,A.X.Chang,M.Savva,S.Savarese,andT.Funkhouser,“Im2Pano3D:Extrapolating360structureandsemanticsbeyondthefieldofview,”inProc.IEEE/CVFConf.Comput.Vis.PatternR

278、ecognit.,2018,pp.38473856.86 N.Savinov et al.,“Episodic curiosity through reachability,”2018,arXiv:1810.02274.87 N.Haber,D.Mrowca,S.Wang,L.F.Fei-Fei,andD.L.Yamins,“Learningto play with intrinsically-motivated,self-aware agents,”in Proc.Adv.Neural Inf.Process.Syst.,2018,pp.83888399.88 Y.Zhu et al.,“T

279、arget-driven visual navigation in indoor scenes usingdeep reinforcement learning,”in Proc.IEEE Int.Conf.Robot.Automat.,2017,pp.33573364.89 D.S.Chaplot,R.Salakhutdinov,A.Gupta,and S.Gupta,“Neural topo-logical SLAM for visual navigation,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.12 87

280、212 881.90 F.Bonin-Font,A.Ortiz,and G.Oliver,“Visual navigation for mobilerobots:A survey,”J.Intell.Robot.Syst.,vol.53,no.3,pp.263296,2008.91 J.Fuentes-Pacheco,J.Ruiz-Ascencio,and J.M.Rendn-Mancha,“Vi-sualsimultaneouslocalizationandmapping:Asurvey,”Artif.Intell.Rev.,vol.43,no.1,pp.5581,2015.92 L.E.K

281、avraki,P.Svestka,J.-C.Latombe,and M.H.Overmars,“Proba-bilistic roadmaps for path planning in high-dimensional configurationspaces,”IEEE Trans.Robot.Automat.,vol.12,no.4,pp.566580,Aug.1996.93 S.M.LaValle and J.J.Kuffner,“Rapidly-exploring random trees:Progress and prospects,”Algorithmic Comput.Robot.

282、,New Directions,vol.5,pp.293308,2001.94 A.Dosovitskiy and V.Koltun,“Learning to act by predicting the future,”2016,arXiv:1611.01779.95 P.Dayan,“Improving generalization for temporal difference learn-ing:The successor representation,”Neural Comput.,vol.5,no.4,pp.613624,1993.96 Y.Zhu et al.,“Visual se

283、mantic planning using deep successor represen-tations,”in Proc.IEEE Int.Conf.Comput.Vis.,2017,pp.483492.97 A.Barreto et al.,“Successor features for transfer in reinforcementlearning,”Adv.Neural Inf.Process.Syst.,vol.30,pp.40554065,2017.98 D.Gordon,A.Kadian,D.Parikh,J.Hoffman,and D.Batra,“SplitNet:Si

284、m2Simandtask2Tasktransferforembodiedvisualnavigation,”inProc.IEEE/CVF Int.Conf.Comput.Vis.,2019,pp.10221031.99 J.Schulman,F.Wolski,P.Dhariwal,A.Radford,and O.Klimov,“Prox-imal policy optimization algorithms,”2017,arXiv:1707.06347.100 A.Mousavian,A.Toshev,M.Fier,J.Koeck,A.Wahid,and J.David-son,“Visua

285、l representations for semantic target driven navigation,”inProc.Int.Conf.Robot.Automat.,2019,pp.88468852.101 J.Schulman,P.Moritz,S.Levine,M.Jordan,and P.Abbeel,“High-dimensionalcontinuouscontrolusinggeneralizedadvantageestimation,”2015,arXiv:1506.02438.102 Z.D.Guo,M.G.Azar,B.Piot,B.A.Pires,and R.Mun

286、os,“Neuralpredictive belief representations,”2018,arXiv:1811.06407.103 C.Gan,Y.Zhang,J.Wu,B.Gong,and J.B.Tenenbaum,“Look,listen,and act:Towards audio-visual embodied navigation,”in Proc.IEEE Int.Conf.Robot.Automat.,2020,pp.97019707.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2

287、022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS15104 P.Anderson et al.,“Vision-and

288、-language navigation:Interpretingvisually-groundednavigationinstructionsinrealenvironments,”inProc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.36743683.105 J.Thomason,M.Murray,M.Cakmak,and L.Zettlemoyer,“Vision-and-dialog navigation,”in Proc.Conf.Robot Learn.,2019,pp.394406.106 R.J.Williams,“

289、Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”Mach.Learn.,vol.8,no.3/4,pp.229256,1992.107 J.Straub et al.,“The replica dataset:A digital replica of indoor spaces,”2019,arXiv:1906.05797.108 J.Johnson,B.Hariharan,L.vanderMaaten,L.Fei-Fei,C.L.Zitnick,andR.Gir

290、shick,“CLEVR:A diagnostic dataset for compositional languageand elementary visual reasoning,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2017,pp.29012910.109 E.Todorov,T.Erez,andY.Tassa,“MuJoCo:A.physicsengineformodel-based control,”in Proc.IEEE/RSJ Int.Conf.Intell.Robots Syst.,2012,pp.50265033.1

291、10 C.Gao,J.Chen,S.Liu,L.Wang,Q.Zhang,and Q.Wu,“Room-and-object aware knowledge reasoning for remote embodied referringexpression,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2021,pp.30643073.111 J.Sun,D.-A.Huang,B.Lu,Y.-H.Liu,B.Zhou,and A.Garg,“PlaTe:Visually-groundedplanningwithtransformersi

292、nproceduraltasks,”2021,arXiv:2109.04869.112 F.Sadeghi and S.Levine,“CAD2RL:Real single-image flight without asingle real image,”Robot.,Sci.Syst.XIII,Massachusetts Inst.Technol.,Cambridge,MA,USA,Jul.2017.113 D.M.Bear et al.,“Physion:Evaluating physical prediction from visionin humans and machines,”20

293、21,arXiv:2106.08261.114 J.Duan,S.Y.B.Jian,and C.Tan,“SPACE:A simulator for physicalinteractionsandcausallearningin3Denvironments,”inProc.IEEE/CVFInt.Conf.Comput.Vis.,2021,pp.20582063.115 J.Duan,S.Yu,S.Poria,B.Wen,and C.Tan,“PIP:Physical in-teraction prediction via mental imagery with span selection,

294、”2021,arXiv:2109.04683.116 T.Nagarajan and K.Grauman,“Learning affordance landscapes forinteraction exploration in 3D environments,”Adv.Neural Inf.Process.Syst.,vol.33,2020.117 M.Wang,Y.Deng,X.Kong,A.H.Prasad,S.Xiong,and B.Zhu,“Thin-film smoothed particle hydrodynamics fluid,”2021,arXiv:2105.07656.1

295、18 A.Kuznetsov,K.Mullia,Z.Xu,M.Haan,and R.Ramamoorthi,“Neu-MIP:Multi-resolution neural materials,”ACM Trans.Graph.,vol.40,no.4,pp.113,2021.119 S.R.Richter,H.A.AlHaija,and V.Koltun,“Enhancing photorealismenhancement,”2021,arXiv:2105.04619.120 T.-Y.Lin et al.,“Microsoft coco:Common objects in context,

296、”in Proc.Eur.Conf.Comput.Vis.,Cham,Switzerland,Springer,2014,pp.740755.121 S.K.Ramakrishnanetal.,“Habitat-matterport3Ddataset(HM3D):1000large-scale3DenvironmentsforembodiedAI,”2021,arXiv:2109.08238.122 I.Armeni,S.Sax,A.R.Zamir,and S.Savarese,“Joint 2D-3D-semanticdata for indoor scene understanding,”

297、2017,arXiv:1702.01105.123 K.Atkinson,“Introduction to modern photogrammetry,”Wiley OnlineLibrary,New York,NY,USA,vol.19,2003.124 Alicevision,“Blender-A 3D modelling and rendering package,Al-icevision,”2018.Online.Available:https:/ A.Yu,V.Ye,M.Tancik,and A.Kanazawa,“PixelNeRf:Neural radiancefields fr

298、om one or few images,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2021,pp.45784587.126 R.Martin-Brualla,N.Radwan,M.S.Sajjadi,J.T.Barron,A.Dosovit-skiy,and D.Duckworth,“NeRF in the wild:Neural radiance fields forunconstrainedphotocollections,”inProc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,20

299、21,pp.72107219.127 Z.Hao,A.Mallya,S.Belongie,andM.-Y.Liu,“GANcraft:Unsupervised3D neural rendering of minecraft worlds,”2021,arXiv:2104.07659.128 Y.Zhu et al.,“Dark,beyond deep:A paradigm shift to cognitive AIwithhumanlikecommonsense,”Engineering,vol.6,no.3,pp.310345,2020.129 M.Lohmann,J.Salvador,A.

300、Kembhavi,and R.Mottaghi,“Learningaboutobjectsbylearningtointeractwiththem,”Adv.NeuralInf.Process.Syst.,vol.33,2020.130 R.Bhirangi,T.Hellebrekers,C.Majidi,andA.Gupta,“Reskin:Versatile,replaceable,lasting tactile skins,”in CoRL,2021.131 B.Smithetal.,“Constrainingdensehandsurfacetrackingwithelasticity,

301、”ACM Trans.Graph.,vol.39,no.6,pp.114,2020.132 C.Chen et al.,“Audio-visual embodied navigation,”Environment,vol.97,p.103,2019.133 L.Panait and S.Luke,“Cooperative multi-agent learning:The state ofthe art,”Auton.Agents Multi-Agent Syst.,vol.11,no.3,pp.387434,2005.134 Y.-C.Liu,J.Tian,C.-Y.Ma,N.Glaser,C

302、.-W.Kuo,and Z.Kira,“Who2com:Collaborativeperceptionvialearnablehandshakecommuni-cation,”in Proc.IEEE Int.Conf.Robot.Automat.,2020,pp.68766883.Jiafei Duan received the B.Eng.(Highest Distinc-tion)degree from the School of Electrical and Elec-tronics Engineering,Nanyang Technological Univer-sity,Singa

303、pore,in 2021.He is currently a ResearchEngineer with the Institute of Infocomm Research,Agency for Science,Technology and Research,Sin-gapore.His current research interests included em-bodied AI and computational cognitive science.SamsonYureceivedtheB.Eng.degreeininformationsystems technology and de

304、sign from the SingaporeUniversity of Technology and Design,Singapore,in2020.HeiscurrentlyaResearchEngineerwiththeIn-stitute of High Performance Computing,Agency forScience,Technology and Research,Singapore.Hisresearch interests include fundamental AI researchand embodied AI.Hui Li Tan received the B

305、.Sc.degree in appliedmathematics and the Ph.D.degree in electrical andcomputer engineering from the National Universityof Singapore,Singapore,in 2007 and 2017,respec-tively.Since 2007,she has been with the Institute forInfocommResearch,Singapore.Hercurrentresearchinterests include computer vision,mu

306、ltimodal deeplearning,incremental and federated learning.Hongyuan Zhu received the Ph.D.degree in com-puter engineering from Nanyang Technological Uni-versity,Singapore,in2014.HeiscurrentlyaResearchScientist with the Institute for Infocomm Research,A*STAR,Singapore.His research interests includemult

307、imedia content analysis and segmentation.Cheston Tan received the B.Sc.(Highest Hons.)degree from the Department of Electrical Engineer-ing and Computer Science,University of California,Berkeley,CA,USA,and the Ph.D.degree from theDepartment of Brain and Cognitive Sciences,Mas-sachusetts Institute of Technology,Cambridge,MA,USA.He is currently a Senior Scientist with theInstituteforInfocommResearch,AgencyforScience,Technology and Research,Singapore.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(IEEE:2025具身智能(Embodied AI)綜述:從模擬器到研究任務的調查分析報告(英文版)(15頁).pdf)為本站 (Yoomi) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站