《IEEE:2025具身智能(Embodied AI)綜述:從模擬器到研究任務的調查分析報告(英文版)(15頁).pdf》由會員分享,可在線閱讀,更多相關《IEEE:2025具身智能(Embodied AI)綜述:從模擬器到研究任務的調查分析報告(英文版)(15頁).pdf(15頁珍藏版)》請在三個皮匠報告上搜索。
1、This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE1A Survey of Embodied AI:From Simulatorsto Research TasksJiafei Duan,Samson Yu,Hui Li Tan,Ho
2、ngyuan Zhu,and Cheston TanAbstractThere has been an emerging paradigm shift from theera of“internet AI”to“embodied AI,”where AI algorithms andagents no longer learn from datasets of images,videos or textcurated primarily from the internet.Instead,they learn throughinteractionswiththeirenvironmentsfr
3、omanegocentricperceptionsimilartohumans.Consequently,therehasbeensubstantialgrowthin the demand for embodied AI simulators to support variousembodied AI research tasks.This growing interest in embodied AIis beneficial to the greater pursuit of Artificial General Intelligence(AGI),but there has not b
4、een a contemporary and comprehensivesurvey of this field.This paper aims to provide an encyclopedicsurvey for the field of embodied AI,from its simulators to itsresearch.By evaluating nine current embodied AI simulators withour proposed seven features,this paper aims to understand thesimulators in t
5、heir provision for use in embodied AI research andtheirlimitations.Lastly,thispapersurveysthethreemainresearchtasks in embodied AI visual exploration,visual navigation andembodied question answering(QA),covering the state-of-the-artapproaches,evaluation metrics and datasets.Finally,with the newinsig
6、htsrevealedthroughsurveyingthefield,thepaperwillprovidesuggestionsforsimulator-for-taskselectionsandrecommendationsfor the future directions of the field.Index TermsEmbodied AI,computer vision,3D simulators.I.INTRODUCTIONRECENT advances in deep learning,reinforcement learn-ing,computer graphics and
7、robotics have garnered grow-ing interest in developing general-purpose AI systems.As aresult,there has been a shift from“internet AI”that focuseson learning from datasets of images,videos and text curatedfrom the internet,towards“embodied AI”which enables artifi-cial agents to learn through interact
8、ions with their surroundingenvironments.Embodied AI is the belief that true intelligencecan emerge from the interactions of an agent with its envi-ronment 1.But for now,embodied AI is about incorporatingManuscriptreceivedMay9,2021;revisedSeptember30,2021andNovember8,2021;accepted December 4,2021.Thi
9、s work was supported in part by theAgency for Science,Technology and Research(A*STAR),Singapore under itsAME Programmatic Funding Scheme under Award No.#A18A2b0046 and inpart by the National Research Foundation,Singapore under its NRFISF JointCall under Award No.NRF2015-NRF-ISF001-2541.(Correspondin
10、g author:Jiafei Duan.)Jiafei Duan is with the Nanyang Technological University of Singapore,School of Electrical and Electronics Engineering,Singapore 639798(e-mail:duanjiafeihotmail.sg).Samson Yu is with the Singapore University of Technology and Design,Singapore 487372(e-mail:samsonyumymail.sutd.e
11、du.sg).Hui Li Tan,Hongyuan Zhu,and Cheston Tan are with the Insti-tute for Infocomm Research,A*STAR,Singapore 138632(e-mail:hltani2r.a-star.edu.sg;zhuhi2r.a-star.edu.sg;cheston-tani2r.a-star.edu.sg).Digital Object Identifier 10.1109/TETCI.2022.3141105traditional intelligence concepts from vision,lan
12、guage,and rea-soning into an artificial embodiment to help solve AI problemsin a virtual environment.The growing interest in embodied AI has led to significantprogress in embodied AI simulators that aim to faithfully repli-cate the physical world.These simulated worlds serve as virtualtestbeds to tr
13、ain and test embodied AI frameworks before de-ployingthemintotherealworld.TheseembodiedAIsimulatorsalso facilitate the collection of task-based dataset 2,3 whichare tedious to collect in real-world as it requires an extensiveamount of manual labor to replicate the same setting as in thevirtual world
14、.While there have been several survey papers inthe field of embodied AI 46,they are mostly outdated astheywerepublishedbeforethemoderndeeplearningera,whichstarted around 2009 710.To the best of our knowledge,there is only one survey paper on the evaluating embodiednavigation 11.To address the scarci
15、ty of contemporary comprehensive sur-vey papers on this emerging field of embodied AI,we pro-pose this survey paper on the field of embodied AI,from itssimulators to research tasks.This paper covers the follow-ing nine embodied AI simulators that were developed overthe past four years:DeepMind Lab 1
16、2,AI2-THOR 13,CHALET 14,VirtualHome 15,VRKitchen 16,Habitat-Sim 17,iGibson 18,SAPIEN 19,and ThreeDWorld 20.The chosen simulators are designed for general-purpose in-telligence tasks,unlike game simulators 21 which are onlyused for training reinforcement learning agents.These em-bodied AI simulators
17、provide realistic representations of thereal world in computer simulations,mainly taking the config-urations of rooms or apartments that provide some forms ofconstraint to the environment.The majority of these simulatorsminimally comprise a physics engine,Python API,and arti-ficial agent that can be
18、 controlled or manipulated within theenvironment.EmbodiedAIsimulatorshavegivenrisetoaseriesofpotentialembodied AI research tasks,such as visual exploration,visualnavigationandembodiedQA.Wewillfocusonthesethreetaskssincemostexistingpapers11,22,23inembodiedAIeitherfocusonthesetasksormakeuseofmodulesin
19、troducedforthesetasks to build models for more complex tasks like audio-visualnavigation.These three tasks are also connected in increasingcomplexity.Visual exploration is a very useful component invisualnavigation22,24andusedforrealisticsituations25,26,whileembodiedQAfurtherinvolvescomplexQAcapabil
20、-ities that builds on top of vision-and-language navigation.Since2471-285X 2022 IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See https:/www.ieee.org/publications/rights/index.html for more information.Authorized licensed use limited to:ASTAR.Downloaded on
21、February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.2IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCETABLE ISUMMARY OFEMBOD
22、IEDAI SIMULATORS.ENVIRONMENT:GAME-BASEDSCENECONSTRUCTION(G)ANDWORLD-BASEDSCENECONSTRUCTION(W).PHYSICS:BASICPHYSICSFEATURES(B)ANDADVANCEDPHYSICSFEATURES(A).OBJECTTYPE:DATASETDRIVENENVIRONMENTS(D)ANDOBJECTASSETSDRIVENENVIRONMENTS(O).OBJECTPROPERTY:INTERACT-ABLEOBJECTS(I)ANDMULTI-STATEOBJECTS(M).CONTRO
23、LLER:DIRECTPYTHONAPI CONTROLLER(P),VIRTUALROBOT CONTROLLER(R)ANDVIRTUALREALITYCONTROLLER(V).ACTION:NAVIGATION(N),ATOMICACTION(A)ANDHUMAN-COMPUTERINTERACTION(H).MULTI-AGENT:AVATAR-BASED(AT)ANDUSER-BASED(U).THESEVENFEATURES CAN BEFURTHERGROUPEDUNDERTHREESECONDARYEVALUATIONFEATURES;REALISM,SCALABILITY
24、ANDINTERACTIVITYlanguage is a common modality and visual QA is a populartask in AI,embodied QA is a natural direction for embodied AI.Thesethreetasksdiscussedinthispaperhavebeenimplementedin at least one of the nine proposed embodied AI simulators.However,Sim2Real2729androboticsinthephysicalworldwil
25、l not be covered in this paper.ThesesimulatorsareselectedbasedontheembodiedAIsimu-lators from the Embodied AI Challenge in the annual EmbodiedAIworkshop30atConferenceonComputerVisionandPatternRecognition(CVPR).The research tasks are then sourced fromdirect citations of these simulators.To this end,w
26、e will provide a contemporary and compre-hensive survey of embodied AI simulators and research throughreviewing the development of the field from its simulator toresearch.In Section I,this paper outlines the overview struc-ture of this survey paper.In Section II,this paper benchmarksnine embodied AI
27、 simulators to understand their provision forrealism,scalability,interactivity and hence use in embodiedAI research.Finally,based upon the simulators,in Section III,this paper surveys three main research tasks in embodied AI-visual exploration,visual navigation and embodied questionanswering(QA),cov
28、ering the state-of-the-art approaches,eval-uation,and datasets.Lastly,this paper will establish intercon-nectionsbetweenthesimulators,datasetsandresearchtasksandexisting challenges in embodied AI simulators and researchin Section IV.This survey paper provides a comprehensivelook into the emerging fi
29、eld of embodied AI and further un-veils new insights and challenges of the field.Furthermore,through this paper,we seek to avail AI researchers in select-ing the ideal embodied AI simulators for their research tasksof interest.II.SIMULATORS FOREMBODIEDAIInthissection,thebackgroundsoftheembodiedAIsim
30、ulatorswillbepresentedinthesupplementarymaterial,andthefeaturesof the embodied AI simulators will be compared and discussedin Section II-A.Fig.1.Comparison between game-based scene(G)and world-based scene(W).The game-based scene(G)focuses on environment that are constructedfrom 3D object assets,whil
31、e the world-based scene(W)are constructed basedoff real-world scans of the environment.A.Embodied AI SimulatorsThis section presents the backgrounds of the nine embodiedAI simulators:DeepMind Lab,AI2-THOR,SAPIEN,Virtu-alHome,VRKitchen,ThreeDWorld,CHALET,iGibson,andHabitat-Sim.Readers can refer to th
32、e supplementary materialfor more details on the respective simulators.In this section,the paper will comprehensively compare the nine embodied AIsimulators based on seven technical features.Referencing 13,20,31,these seven technical features are selected as theprimary features to evaluate the embodi
33、ed AI simulator as theycover the essential aspects required to replicate the environmentaccurately,interactions and state of the physical world,henceprovidingsuitabletestbedsfortestingintelligencewithembodi-ment.Referring to Table I,the seven features are:Environment,Physics,Object Type,Object Prope
34、rty,Controller,Action,andMulti-Agent.1)Environment:There are two main methods of constructingtheembodiedAIsimulatorenvironment:game-basedscenecon-struction(G)andworld-basedsceneconstruction(W).Referringto Fig.1,the game-based scenes are constructed from 3D as-sets,while world-based scenes are constr
35、ucted from real-worldscans of the objects and the environment.A 3D environmentconstructed entirely out of 3D assets often has built-in physicsfeatures and object classes that are well-segmented when com-pared to a 3D mesh of an environment made from real-worldAuthorized licensed use limited to:ASTAR
36、.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS3Fig.2.
37、Comparison between basics physics features such as rigid-body andcollision(B)andadvancedphysicsfeatures(A)whichincludescloth,soft-body,and fluid physics.scanning.Theclearobjectsegmentationforthe3Dassetsmakesit easy to model them as articulated objects with movable joints,such as the 3D models provid
38、ed in PartNet 32.In contrast,thereal-world scans of environments and objects provide higherfidelity and more accurate representation of the real-world,facilitatingbettertransferofagentperformancefromsimulationto the real world.As observed in Table I,most simulators otherthan Habitat-Sim and iGibson
39、have game-based scenes,sincesignificantly more resources are required for world-based sceneconstruction.2)Physics:A simulator has to construct not only realisticenvironments but also realistic interactions between agents andobjects or objects and objects that model real-world physicsproperties.We st
40、udy the simulators physics features,whichwe broadly classify into basic physics features(B)and ad-vanced physics features(A).Referring to Fig.2,basic physicsfeatures include collision,rigid-body dynamics,and gravitymodelling while advanced physics features include cloth,fluid,and soft-body physics.A
41、s most embodied AI simulators con-struct game-based scenes with in-built physics engines,they areequipped with the basic physics features.On the other hand,forsimulators like ThreeDWorld,where the goal is to understandhow the complex physics environment can shape the decisionsof the artificial agent
42、 in the environment,they are equipped withmore advanced physics capabilities.For simulators that focuson interactive navigation-based tasks,basic physics features aregenerally sufficient.3)ObjectType:AsshowninFig.3,therearetwomainsourcesfor objects that are used to create the simulators.The first ty
43、peis the dataset driven environment,where the objects are mainlyfrom existing object datasets such as the SUNCG 33 dataset,the Matterport3D dataset 34 and the Gibson dataset 35.Thesecond type is the asset driven environment,where the objectsare from the net such as the Unity 3D game asset store.Adif
44、ference between the two sources is the sustainability of theobject dataset.The dataset driven objects are more costly tocollect than the asset driven objects,as anyone can contributeto the 3D object models online.However,it is harder to ensureFig.3.Comparison between dataset driven environment(D)whi
45、ch are con-structed from 3D objects datasets and object assets driven environment(O)areconstructed based 3D objects obtain from the assets market.Fig.4.Comparison between direct Python API controller(P),robotics em-bodiment(R)which refers to real-world robots with a virtual replica and lastlythe vir
46、tual reality controller(V).the quality of the 3D object models in the asset driven objectsthan in the dataset driven objects.Based on our review,thegame-based embodied AI simulators are more likely to obtaintheir object datasets from asset stores,whereas the world-basedsimulators tend to import thei
47、r object datasets from existing 3Dobject datasets.4)ObjectProperty:Somesimulatorsonlyenableobjectswithbasicinteractivitysuchascollision.Advancedsimulatorsenableobjects with more fine-grained interactivity such as multiple-state changes.For instance,when an apple is sliced,it willundergo a state chan
48、ge into apple slices.Hence,we catego-rize these different levels of object interaction into simulatorswith interact-able objects(I)and multiple-state objects(M).Referring to Table I,a few simulators,such as AI2-THOR andVRKitchen,enablemultiplestatechanges,providingaplatformfor understanding how obje
49、cts will react and change their stateswhen acted upon in the real world.5)Controller:Referring to Fig.4,there are different typesof controller interface between the user and simulator,fromdirect Python API controller(P)and virtual robot controller(R)tovirtualrealitycontroller(V).Roboticsembodimental
50、lowsforvirtualinteractionofexistingreal-worldrobotssuchasUniversalRobot 5(UR5)and TurtleBot V2,and can be controlled directlyusing a ROS interface.The virtual reality controller interfacesprovide more immersive human-computer interaction and fa-cilitate deployment using their real-world counterparts
51、.For in-stance,simulators such as iGibson and AI2-THOR,which areprimarily designed for visual navigation,are also equipped withvirtualrobotcontrollerforeaseofdeploymentintheirreal-worldcounterpartssuchasiGibsonsCastro36andRoboTHOR37respectively.Authorized licensed use limited to:ASTAR.Downloaded on
52、February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.4IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE6)Action:There are dif
53、ferences in the complexity of an arti-ficial agents action capabilities in the embodied AI simulator,ranging from being only able to perform primary navigationmanoeuvers to higher-level human-computer actions via virtualreality interfaces.This paper classifies them into three tiers ofrobotics manipu
54、lation:navigation(N),atomic action(A)andhuman-computer interaction(H).Navigation is the lowest tierand is a common feature in all embodied AI simulators 38.It is defined by the agents capability of navigating around itsvirtual environment.Atomic action provides the artificial agentwith a means of pe
55、rforming basic discrete manipulation to anobject of interest and is found in most embodied AI simulators.Human-computer interaction is the result of the virtual realitycontroller as it enables humans to control virtual agents tolearn and interact with the simulated world in real time 16.Most of the
56、larger-scale navigation-based simulators,such asAI2-THOR,iGibson and Habitat-Sim,tend to have navigation,atomic action and ROS 13,17,35 which enable themto provide better control and manipulation of objects in theenvironment while performing tasks such as Point Navigationor Object Navigation.On the
57、other hand,simulators such asThreeDWorld and VRKitchen 16,20 fall under the human-computerinteractioncategoryastheyareconstructedtoprovidea highly realistic physics-based simulation and multiple statechanges.Thisisonlypossiblewithhuman-computerinteractionas human-level dexterity is needed when inter
58、acting with thesevirtual objects.7)Multi-Agent:Referring to Table I,only a few simulators,such as AI2-THOR,iGibson and ThreeDWorld,are equippedwithmulti-agentsetup,ascurrentresearchinvolvingmulti-agentreinforcement learning isscarce.Ingeneral,the simulatorsneedto be rich in object content before the
59、re is any practical value ofconstructing such multi-agent features used for both adversarialand collaborative training 39,40 of artificial agents.As aresultofthislackofmulti-agentsupportedsimulators,therehavebeen fewer research tasks that utilize the multi-agent feature inthese embodied AI simulator
60、s.For multi-agent reinforcement learning based training,theyarestillcurrentlybeingdoneinOpenAIGymenvironments41.Therearetwodistinctmulti-agentsettings.Thefirstistheavatar-based(AT)multi-agents in ThreeDWorld 20 that allows forinteraction between artificial agents and simulation avatars.Thesecond is
61、the user-based(U)multi-agents in AI2-THOR 13which can take on the role of a dual learning network and learnfrom interacting with other artificial agents in the simulation toachieve a common task 42.B.Comparison of Embodied AI SimulatorsConstructed on the seven features and a study from the AllenInst
62、itute of Artificial Intelligence 31 on embodied AI,wepropose a secondary set of evaluation features for the simu-lators.It comprises of three key features:realism,scalabilityand interactivity as shown in Table I.The realism of the 3Denvironments can be attributed to the environment and physicsof the
63、 simulators.The environment models the real worldsphysical appearance while the physics models the complexFig.5.A pyramid hierarchical structure of the various embodied AI researchtasks with increasing complexity of tasks.physical properties within the real world.Scalability of the 3Denvironmentscan
64、beattributedtotheobjecttype.Theexpansioncan be done via collecting more 3D scans of the real world forthe dataset driven objects or purchasing more 3D assets for theassetdrivenobjects.Interactivityisattributedtoobjectproperty,controller,action and multi-agent.Based on the secondary evaluation featur
65、es of embodied AIsimulators,the seven primary features from the Table I and theFig.6,simulatorswhichpossessalloftheabovethreesecondaryfeatures(e.g.AI2-THOR,iGibson and Habitat-Sim)are morewell-received and widely used for a diverse range of embodiedAI research tasks.Furthermore,a comprehensive quant
66、itativecomparison is made for all the embodied AI simulators tocompare the environment configuration and the technical per-formance of each simulator.The environment configurationfeatureisverymuchdependentontheapplicationssuggestedbythecreatorsofthesimulators,whileotherfeaturesliketechnicalspecifica
67、tion and rendering performance are largely due tothe simulation engine used for its creation.AI2-THOR hasthe largest environment configurations compared to the othersimulators,while Habitat-Sim and iGibson are the top two per-formers in graphic rendering performance.This benchmark ofquantitativeperf
68、ormanceshowninTableIIfurtherdemonstratesthe superiority and complexity of these three embodied AIsimulators.These comparisons of the embodied AI simulatorsfurther have reinforced the importance of the seven primaryevaluation metrics and the three secondary evaluations that thepaper has established t
69、o help select the ideal simulator for theresearch task.III.RESEARCH INEMBODIEDAIIn this section,we discuss the various embodied AI researchtasks that depend on the nine embodied AI simulators surveyedin the previous section.There are multiple motivations for therecent increase in embodied AI researc
70、h.From a cognitivescience and psychology perspective,the embodiment hypoth-esis 1 suggests that intelligence arises from interactions withan environment and as a result of sensorimotor activity 66.Intuitively,humans do not learn solely through the“internetAI”paradigm where most experiences are rando
71、mized andAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMB
72、ODIED AI:FROM SIMULATORS TO RESEARCH TASKS5Fig.6.Connections between Embodied AI simulators to research.(Top)Nine up-to-date embodied AI simulators.(Middle)The various embodied AI researchtasks as a result of the nine embodied AI simulators.The red colored research tasks are grouped under the visual
73、 navigation category while the rest of the yellowcolored tasks are the other research categories.(Bottom)The evaluation dataset used in the evaluation of the research tasks in one of the nine embodied AIsimulators.TABLE IICOMPARISON OFEMBODIEDAI SIMULATORS INTERMS OFENVIRONMENTCONFIGURATION,SIMULATI
74、ONENGINE,TECHNICALSPECIFICATION,ANDRENDERINGPERFORMANCEpassive(i.e.externally curated).Humans also learn throughactive perception,movement,interaction and communication.From an AI perspective,current research tasks in embodied AIallows for greater generalization to unseen environments 44for robotic
75、functions like mapping and navigation and greaterrobustnesstosensornoiseascomparedtoclassicalmethodsdueto the learning involved.Embodied AI also enables flexibilityand possibly greater performance since various modalities likedepth,language 59 and audio 67 can be easily integratedthrough learning-ba
76、sed approaches.ThethreemaintypesofembodiedAIresearchtasksarevisualexploration,visual navigation and embodied QA.We will focuson these three tasks since most existing papers in embodied AIeither focus on these tasks or make use of modules introducedfor these tasks to build models for more complex tas
77、ks likeaudio-visual navigation.The tasks increase in complexity asit advances from exploration to QA.We will start with thevisualexplorationbeforemovingtovisualnavigationandfinallyembodied QA.Each of these tasks makes up the foundation forthe next task(s),forming a pyramid structure of embodied AIre
78、search tasks as shown in Fig.5,further suggesting a naturaldirection for embodied AI.We will highlight important aspectsfor each task,starting with the summary,the methodologies,evaluation metrics,to the datasets.These task details are foundin Table III.A.Visual ExplorationIn visual exploration 24,6
79、8,an agent gathers informationabout a 3D environment,typically through motion and per-ception,to update its internal model of the environment 11,22,which might be useful for downstream tasks like visualnavigation 24,25,69.The aim is to do this as efficientlyas possible(e.g.with as few steps as possi
80、ble).The internalmodel can be in forms like a topological graph map 26,semanticmap46,occupancymap45orspatialmemory70,71.These map-based architectures can capture geometry andsemantics,allowing for more efficient policy learning and plan-ning 45 as compared to reactive and recurrent neural networkpol
81、icies 72.Visual exploration is usually either done beforeor concurrently with navigation tasks.In the first case,visualexploration builds the internal memory as priors that are usefulfor path-planning in downstream navigation tasks.The agent isfree to explore the environment within a certain budget(
82、e.g.limited number of steps)before the start of navigation 11.In the latter case,the agent builds the map as it navigates anunseen test environment 48,73,74,which makes it moretightly integrated with the downstream task.In this section,webuild upon existing visual exploration survey papers 22,24to i
83、nclude more recent works and directions.In classical robotics,exploration is done through passive oractivesimultaneouslocalisationandmapping(SLAM)24,45to build a map of the environment.This map is then used withlocalization and path-planning for navigation tasks.SLAM isvery well-studied 75,but the p
84、urely geometric approach hasroom for improvements.Since they rely on sensors,they aresusceptibletomeasurementnoise24andwouldneedextensivefine-tuning.On the other hand,learning-based approaches thattypically use RGB and/or depth sensors are more robust tonoise 24,44.Furthermore,learning-based approac
85、hes in vi-sualexplorationallowanartificialagenttoincorporatesemanticAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented
86、,with the exception of pagination.6IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCETABLE IIISUMMARY OFEMBODIEDAI RESEARCHTASKS.EVALUATIONMETRIC:AMOUNT OFTARGETSVISITED(ATV),DOWNSTREAMTASKS(D),SUCCESSWEIGHTED BYPATHLENGTH(SPL),SUCCESSRATE(SR),PATHLENGTHRATIO(PLR),ORACLESUCCESSRATE(O
87、SR),TRAJECTORY/EPISODELENGTH(TL/EL),DISTANCE TOSUCCESS/NAVIGATIONERROR(DTS/NE/dT),GOALPROGRESS(GP/d),ORACLEPATHSUCCESSRATE(OPSR),SMALLESTDISTANCE TOTARGET AT ANYPOINT IN ANEPISODE(dmin),PERCENTAGE OFEPISODESAGENTENDSNAVIGATION FORANSWERINGBEFOREMAXEPISODELENGTH(%Stop),PERCENTAGE OFQUESTIONSAGENTTERM
88、INATES IN THEROOMCONTAINING THETARGETOBJECT(%rT),PERCENTAGE OFQUESTIONSWHERE THEAGENTENTERS THEROOMCONTAINING THETARGETOJECT ATLEASTONCE(%re),INTERSECTIONOVERUNION FORTARGETOBJECT(IOU),HITACCURACYBASED ONIOU(hT),MEANRANK OF THEGROUND-TRUTHANSWER INQA PREDICTIONS(MR)ANDQA ACCURACY(ACC)understanding(e
89、.g.object types in the environment)45 andgeneralize its knowledge of previously seen environments tohelp with understanding novel environments in an unsupervisedmanner.This reduces reliance on humans and thus improvesefficiency.Learning to create useful internal models of the environmentin the form
90、of maps can improve the agents performance 45,whether it is done before(i.e.unspecified downstream tasks)or concurrently with downstream tasks.Intelligent explorationwould also be especially useful in cases where the agent hasto explore novel environments that dynamically unfold overtime76,suchasres
91、cuerobotsanddeep-seaexplorationrobots.1)Approaches:In this section,the non-baseline approachesin visual exploration are typically formalized as partially ob-served Markov decision processes(POMDPs)77.A POMDPcan be represented by a 7-tuple(S,A,T,R,O,)with statespace S,action space A,transition distri
92、bution T,reward func-tion R,observation space,observation distribution O and dis-count factor 0,1.In general,these approaches are viewedas a particular reward function in the POMDP 22.a)Baselines:Visual exploration has a few common base-lines 22.For random-actions 17,the agent samples froma uniform
93、distribution over all actions.For forward-action,italways chooses the forward action.For forward-action+,theagent chooses the forward action,but turns left if it collides.For frontier-exploration,it visits the edges between free andunexplored spaces iteratively using a map 24,78.b)Curiosity:In the c
94、uriosity approach,the agent seeks statesthat are difficult to predict.The prediction error is used as therewardsignalforreinforcementlearning79,80.Thisfocuseson intrinsic rewards and motivation rather than external rewardsfrom the environment,which is beneficial in cases where exter-nalrewardsarespa
95、rse81.Thereisusuallyaforward-dynamicsmodel that minimises the loss:L(st+1,st+1).In this case,st+1is the predicted next state if the agent takes action atwhenit is in state st,while st+1is the actual next state that theagent will end up in.Practical considerations for curiosity havebeen listed in rec
96、ent work 79,such as using Proximal PolicyOptimization(PPO)for policy optimisation.Curiosity has beenused to generate more advanced maps like semantic maps inrecent work 43.Stochasticity poses a serious challenge inthe curiosity approach,since the forward-dynamics model canexploit stochasticity 79 fo
97、r high prediction errors(i.e.highrewards).This can arise due to factors like the“noisy-TV”problemornoiseintheexecutionoftheagentsactions81.Oneproposed solution is the use of an inverse-dynamics model 68that estimates the action at1taken by the agent to move fromits previous state st1to its current s
98、tate st,which helps theagentunderstandwhatitsactionscancontrolintheenvironment.While this method attempts to address stochasticity due to theenvironment,it may be insufficient in addressing stochasticitythat results from the agents actions.One example is the agentsuse of a remote controller to rando
99、mly change TV channels,allowing it to accumulate rewards without progress.To addressthis more challenging issue specifically,there have been a fewmethods proposed recently.Random Distillation Network 82is one method that predicts the output of a randomly initializedneural network,as the answer is a
100、deterministic function of itsinputs.Another method is Exploration by Disagreement 81,where the agent is incentivised to explore the action spacewhich has the maximum disagreement or variance between thepredictions of an ensemble of forward-dynamics models.Themodels converges to mean,which reduces th
101、e variance of theensembleandpreventsitfromgettingstuckinstochasticitytraps.c)Coverage:In the coverage approach,the agent tries tomaximise the amount of targets it directly observes.Typically,Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Rest
102、rictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS7this would be the area seen in an environment 22,24,44.Since the agent uses
103、 egocentric observations,it has to navigatebased on possibly obstructive 3D structures.One recent methodcombines classic and learning-based methods 44.It uses an-alytical path planners with a learned SLAM module that main-tains a spatial map,to avoid the high sample complexities in-volvedintraininge
104、nd-to-endpolicies.Thismethodalsoincludesnoise models to improve physical realism for generalisability toreal-world robotics.Another recent work is a scene memorytransformer which uses the self-attention mechanism adaptedfrom the Transformer model 83 over the scene memory in itspolicy network 72.The
105、scene memory embeds and stores allencountered observations,allowing for greater flexibility andscalability as compared to a map-like memory that requiresinductive biases.d)Reconstruction:In the reconstruction approach,the agenttries to recreate other views from an observed view.Past workfocuses on p
106、ixel-wise reconstructions of 360panoramas andCAD models 84,85,which are usually curated datasetsof human-taken photos 45.Recent work has adapted thisapproach for embodied AI,which is more complex becausethe model has to perform scene reconstruction from the agentsegocentric observations and the cont
107、rol of its own sensors(i.e.activeperception).Inarecentwork,theagentusesitsegocentricRGB-D observations to reconstruct the occupancy state beyondvisible regions and aggregate its predictions over time to forman accurate occupancy map 45.The occupancy anticipation isa pixel-wise classification task wh
108、ere each cell in a local areaof V x V cells in front of the camera is assigned probabilitiesof it being explored and occupied.As compared to the coverageapproach,anticipating the occupancy state allows the agent todealwithregionsthatarenotdirectlyobservable.Anotherrecentwork focuses on semantic reco
109、nstruction rather than pixel-wisereconstruction 22.The agent is designed to predict whethersemantic concepts like“door”are present at sampled querylocations.Using a K-means approach,the true reconstructionconcepts for a query location are the J nearest cluster centroidsto its feature representation.
110、The agent is rewarded if it obtainsviews that help it predict the true reconstruction concepts forsampled query views.2)Evaluation Metrics:Amount of targets visited.Differenttypes of targets are considered,such as area 44,86 andinteresting objects 72,87.The area visited metric has afew variants,such
111、 as the absolute coverage area in m2and thepercentage of the area explored in the scene.a)Impact on downstream tasks:Visual exploration perfor-mance can also be measured by its impact on downstream taskslike visual navigation.This evaluation metric category is morecommonly seen in recent works.Examp
112、les ofdownstream tasksthat make use of visual exploration outputs(i.e.maps)includeImage Navigation 26,73,Point Navigation 11,44 andObject Navigation 53,54,56.More details about thesenavigation tasks can be found in Section III-B.3)Datasets:For visual exploration,some popular datasetsincludeMatterpor
113、t3DandGibsonV1.Matterport3DandGibsonV1arebothphotorealisticRGBdatasetswithusefulinformationfor embodied AI like depth and semantic segmentations.TheHabitat-Sim simulator allows for the usage of these datasetswith extra functionalities like configurable agents and multiplesensors.GibsonV1hasalsobeene
114、nhancedwithfeatureslikein-teractions and realistic robot control to form iGibson.However,more recent 3D simulators like those mentioned in Section IIcan all be used for visual exploration,since they all offer RGBobservations at the very least.B.Visual NavigationIn visual navigation,an agent navigate
115、s a 3D environmentto a goal with or without external priors or natural languageinstruction.Many types of goals have been used for this task,such as points,objects,images 88,89 and areas 11.Wewill focus on points and objects as goals for visual navigationin this paper,as they are the most common and
116、fundamentalgoals.They can be further combined with specifications likeperceptual inputs and language to build towards more complexvisual navigation tasks,such as Navigation with Priors,Vision-and-Language Navigation and even Embodied QA.Under pointnavigation 49,the agent is tasked to navigate to a s
117、pecificpoint while in object navigation 38,52,the agent is taskedto navigate to an object of a specific class.While classic navigation approaches 90 are usually com-posed of hand-engineered sub-components like localization,mapping 91,path planning 92,93 and locomotion.Visualnavigation in embodied AI
118、 aims to learn these navigation sys-tems from data,so as to reduce case-specific hand-engineering,henceeasingintegrationwithdownstreamtaskshavingsuperiorperformance with the data-driven learning methods,such asquestion answering 23.There are also hybrid approaches 44that aim to combine the best of b
119、oth worlds.As previously men-tioned in Section II,learning-based approaches are more robustto sensor measurement noise as they use RGB and/or depthsensorsandareabletoincorporatesemanticunderstandingofanenvironment.Furthermore,theyenableanagenttogeneralizeitsknowledge of previously seen environments
120、to help understandnovelenvironmentsinanunsupervisedmanner,reducinghumaneffort.Along with the increase in research in recent years,chal-lenges have also been organised for visual navigation in thefundamental point navigation and object navigation tasks tobenchmark and accelerate progress in embodied
121、AI 38.Themost notable challenges are the iGibson Sim2Real Challenge,Habitat Challenge 36 and RoboTHOR Challenge.For eachchallenge,we will describe the 2020 version of the challenges,which is the latest as of this paper.In all three challenges,theagent is limited to egocentric RGB-D observations.For
122、theiGibson Sim2Real Challenge 2020,the specific task is pointnavigation.73 high-quality Gibson 3D scenes are used fortraining,while the Castro scene,the reconstruction of a realworld apartment,will be used for training,development andtesting.There are three scenarios:when the environment is freeof o
123、bstacles,contains obstacles that the agent can interact with,and/or is populated with other moving agents.For the HabitatChallenge 2020,there are both point navigation and objectnavigation tasks.Gibson 3D scenes with Gibson dataset splitsAuthorized licensed use limited to:ASTAR.Downloaded on Februar
124、y 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.8IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEare used for the point naviga
125、tion task,while 90 Matterport3Dscenes with the 61/11/18 training/validation/test house splitsspecifiedbytheoriginaldataset11,34areusedfortheobjectnavigation task.For the RoboTHOR Challenge 2020,there isonly the object navigation task.The training and evaluation aresplit into three phases.In the firs
126、t phase,the agent is trained on60 simulated apartments and its performance is validated on 15other simulated apartments.In the second phase,the agent willbe evaluated on four simulated apartments and their real-worldcounterparts,totestitsgeneralisationtotherealworld.Inthelastphase,the agent will be
127、evaluated on 10 real-world apartments.Inthissection,webuilduponexistingvisualnavigationsurveypapers 11,23,74 to include more recent works.1)Categories:Point Navigation has been one of the founda-tional and more popular tasks 44 in recent visual navigationliterature.In point navigation,an agent is ta
128、sked to navigateto any position within a certain fixed distance from a specificpoint 11.Generally,the agent is initialized at the origin(0,0,0)in an environment,and the fixed goal point is specified by 3Dcoordinates(x,y,z)relative to the origin/initial location 11.For the task to be completed succes
129、sfully,the artificial agentwould need to possess a diverse range of skillsets such as visualperception,episodic memory construction,reasoning/planning,and navigation.The agent is usually equipped with a GPS andcompass that allows it to access to their location coordinates,and implicitly their orient
130、ation relative to the goal position 17,49.The targets relative goal coordinates can either be static(i.e.givenonlyonce,atthebeginningoftheepisode)ordynamic(i.e.givenateverytime-step)17.Morerecently,withimperfectlocalizationinindoorenvironments,HabitatChallenge2020hasmoved on to the more challenging
131、task 47 of RGBD-basedonline localization without the GPS and compass.There have been many learning-based approaches to pointnavigationinrecentliterature.Oneoftheearlierworks74usesan end-to-end approach to tackle point navigation in a realisticautonomous navigation setting(i.e.unseen environment with
132、no ground-truth maps and no ground-truth agents poses)withdifferent sensory inputs.The base navigation algorithm is theDirect Future Prediction(DFP)94 where relevant inputs suchascolorimage,depthmapandactionsfromthefourmostrecentobservations are processed by appropriate neural networks(e.g.convoluti
133、onal networks for sensory inputs)and concatenated tobe passed into a two-stream fully connected action-expectationnetwork.Theoutputsarethefuturemeasurementpredictionsforall actions and future time steps.The authors also introduce the Belief DFP(BDFP),which isintendedtomaketheDFPsblack-boxpolicymorei
134、nterpretablebyintroducinganintermediatemap-likerepresentationinfuturemeasurementprediction.Thisisinspiredbytheattentionmecha-nisminneuralnetworks,andsuccessorrepresentations95,96and features 97 in reinforcement learning.Experiments showthat the BDFP outperforms the DFP in most cases,classic nav-igat
135、ion approaches generally outperform learning-based oneswith RGB-D inputs.98 provides a more modular approach.For point navigation,SplitNets architecture consists of onevisual encoder and multiple decoders for different auxiliarytasks(e.g.egomotionprediction)andthepolicy.Thesedecodersaim to learn mea
136、ningful representations.With the same PPOalgorithm99andbehavioralcloningtraining,SplitNetcanout-perform comparable end-to-end methods in previously unseenenvironments.Another work presents a modular architecture for simultane-ous mapping and target-driven navigation in indoors environ-ments 48.In th
137、is work,the authors build upon MapNet 71to include 2.5D memory with semantically-informed featuresand train a LSTM for the navigation policy.They show that thismethodoutperformsalearnedLSTMpolicywithoutamap100in previously unseen environments.With the introduction of the Habitat Challenge in 2019 an
138、dits standardized evaluation,dataset and sensor setups,the morerecent approaches have been evaluated with the Habitat Chal-lenge 2019.The first work comes from the team behind Habitat,and uses the PPO algorithm,the actor-critic model structure andaCNNforproducingembeddingsforvisualinputs.Afollow-upw
139、ork provides an“existence proof”that near-perfect results canbeachievedforthepointnavigationtaskforagentswithaGPS,acompass and huge learning steps(2.5 billion steps as comparedto Habitats first PPO work with 75 million steps)in unseenenvironments in simulations 47.Specifically,the best agentsperform
140、ance is within 3-5%of the shortest path oracle.ThisworkusesamodifiedPPOwithGeneralizedAdvantageEstima-tion 101 algorithm that is suited for distributed reinforcementlearning in resource-intensive simulated environments,namelythe Decentralized Distributed Proximal Policy Optimization(DD-PPO).At every
141、 time-step,the agent receives an egocentricobservation(depth or RGB),gets embeddings with a CNN,utilizes its GPS and compass to update the target position tobe relative to its current position,then finally outputs the nextaction and an estimate of the value function.The experimentsshow that the agen
142、ts continue to improve for a long time,andthe results nearly match that of a shortest-path oracle.The next work aims to improve on this resource-intensivework by increasing sample and time efficiency with auxiliarytasks 49.Using the same DD-PPO baseline architecture fromthe previous work,this work a
143、dds three auxiliary tasks:action-conditionalcontrastivepredictivecoding(CPC|A)102,inversedynamics68andtemporaldistanceestimation.Theauthorsex-periment with different ways of combining the representations.At 40 million frames,the best performing agent achieves thesame performance as the previous work
144、 5.5X faster and evenhasimprovedperformance.ThewinneroftheHabitatChallenge2019 for both the RGB and the RGB-D tracks 44 provides ahybrid solution that combines both classic and learning-basedapproachesasend-to-endlearning-basedapproachesarecompu-tationally expensive.This work incorporates learning i
145、n a mod-ular fashion into a“classic navigation pipeline,”thus implicitlyincorporating the knowledge of obstacle avoidance and controlin low-level navigation.The architecture consists of a learnedNeural SLAM module,a global policy,a local policy and ananalytical path planner.The Neural SLAM module pr
146、edicts amap and agent pose estimate using observations and sensors.The global policy always outputs the target coordinates as thelong-term goal,which is converted to a short-term goal usingthe analytic path planner.Finally,a local policy is trained toAuthorized licensed use limited to:ASTAR.Download
147、ed on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS9navigate to thi
148、s short-term goal.The modular design and useof analytical planning help to reduce the search space duringtraining significantly.Object Navigation is one of the most straightforward tasks,yet one of the most challenging tasks in embodied AI.Objectnavigation focuses on the fundamental idea of navigati
149、ng to anobject specified by its label in an unexplored environment 38.The agent will be initialized at a random position and willbe tasked to find an instance of an object category withinthat environment.Object navigation is generally more complexthan point navigation,since it not only requires many
150、 of thesame skillsets such as visual perception and episodic memoryconstruction,but also semantic understanding.These are whatmakes the object navigation task much more challenging,butalso rewarding to solve.The task of object navigation can be demonstrated or learntthrough adapting,which helps to g
151、eneralize navigation in anenvironment without any direct supervision.This work 51achieves that through a meta-reinforcement learning approach,as the agent learns a self-supervised interaction loss whichhelpstoencourageeffectivenavigation.Unliketheconventionalnavigation approaches for which the agent
152、s freeze the learningmodel during inference,this work allows the agent learns toadapt itself in a self-supervised manner and adjust or correct itsmistake afterwards.This approach prevents an agent from mak-ing too many mistakes before realizing and make the necessarycorrection.Anothermethodistolearn
153、theobjectrelationshipbe-tween objects before executing the planning of navigation.Thiswork 53 implements an object relation graph(ORG)whichis not from external prior knowledge but rather a knowledgegraphthatisbuiltduringthevisualexplorationphase.Thegraphconsists of object relationships such as categ
154、ory closeness andspatial correlations.Navigation with Priors focuses on the idea of injecting se-mantic knowledge or priors in the form of multimodal inputssuch as knowledge graph or audio input or to aid in the trainingof navigation tasks for embodied AI agents in both seen andunseen environments.P
155、ast work 57 that use human priorsof knowledge integrated into a deep reinforcement learningframework has shown that artificial agent can tap onto human-like semantic/functional priors to aid the agent in learning tonavigate and find unseen objects in the unseen environment.Such example taps onto the
156、 understanding that the items ofinterest,such as finding an apple in the kitchen,humans willtend to look at logical locations to begin our search.Theseknowledge are encoded in a graph network and trained uponin a deep reinforcement learning framework.There are other examples of using human priors su
157、ch ashumans ability to perceive and capture correspondences be-tweenanaudiosignalmodalandthephysicallocationofobjectshence to perform navigation to the source of the signal.In thiswork 103,artificial agents pick multiple sensory observationssuch as vision and sound signal of the target objects and f
158、igureouttheshortesttrajectorytonavigationfromitsstartinglocationto the source of the sounds.This work achieves it throughhaving a visual perception mapper,sound perception moduleand dynamic path planners.Vision-and-Language Navigation(VLN)is a task whereagents learn to navigate the environment by fo
159、llowing naturallanguage instructions.The challenging aspect of this task isto perceive both the visual scene and language sequentially.VLN remains a challenging task as it requires agents to makepredictions of future actions based on past actions and in-structions 11.Furthermore,agents might not be
160、able to aligntheir trajectories seamlessly with natural language instructions.Although vision-and-language navigation and visual questionanswering(VQA)might seem similar,there are major differ-ences in both tasks.Both tasks can be formulated as visuallygrounded,sequence-to-sequence transcoding probl
161、ems.How-ever,VLN sequences are much longer and require vision data tobe constantly fed as input and the ability to manipulate cameraviewpoints,as compared to VQA where a single input questionis fed in and an answer is generated.We are now able to givea natural language instruction to a robot and exp
162、ect them toperform the task 2,3,58.These are achieved with theadvancement of recurrent neural network methods 58 for jointinterpretation of both visual and natural language inputs anddatasets that are designed for simplifying processes of task-based instruction in navigation and performing of tasks
163、in the3D environment.One approach for VLN is the Auxiliary Reasoning Naviga-tion framework 59.It tackles four auxiliary reasoning tasks:trajectory retelling,progress estimation,angle prediction andcross-modal matching.The agent learns to reason about theprevious actions and predicts future informati
164、on the tasks.Vision-dialog navigation is the latest extension of VLN asit aims to train an agent to develop the ability to engage ina constant natural language conversation with humans to aidin its navigation.The current work 60 in this area uses aCross-modal Memory Network(CMN)that remembers andund
165、erstandsusefulinformationrelatedtopastnavigationactionsthroughseparatelanguagememoryandvisualmemorymodules,and further uses it to make decisions for navigation.2)Evaluation Metrics:Visual navigation uses(1)successweightedby(normalizedinverse)pathlength(SPL)and(2)suc-cess rate as the main evaluation
166、metrics 11.Success weightedby path length can be defined as:1N?Ni=1Silimax(pi,li).Siis a success indicator for episode i,piis the agents pathlength,liis the shortest path length and N is the number ofepisodes.It is noteworthy that there are some known issueswith success weighted by path length 38.Su
167、ccess rate isthe fraction of the episodes in which the agent reaches thegoal within the time budget 74.There are also other lesscommon evaluation metrics 11,48,52,54,74 in ad-dition to the two mentioned,namely:(3)path length ratio,which is the ratio between the predicted path and the short-est path
168、length and is calculated only for successful episodes;(4)distance to success/navigation error,which measures thedistance between the agents final position and the successthresholdboundaryaroundthenearestobjectorthegoallocationrespectively.Besides the above four metrics,there are another two metricsu
169、sed to evaluate VLN agents.They are:(1)oracle success rate,the rate for which the agent stops at the closest point to the goalAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a
170、future issue of this journal.Content is final as presented,with the exception of pagination.10IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEalong its trajectory;(2)trajectory length.In general,for VLNtasks,the best metric is still SPL as it takes into account of thepath taken and
171、 not just the goal.For vision-dialog navigation,in addition to success rate andoracle success rate,there are another two metrics used:(1)goalprogress,the average agent progress towards the goal location;(2)oracle path success rate,the success rate of agent stoppingat the closest point to goal along
172、the shortest path.3)Datasets:As in visual exploration,Matterport3D andGibson V1 are the most popular datasets.It is noteworthy thatthe scenes in Gibson V1 are smaller and usually have shorterepisodes(lower GDSP from start position to goal position).TheAI2-THOR simulator/dataset is also used.Unlike t
173、he rest of the visual navigation tasks,VLN requiresa different kind of dataset.Most of the VLN works use theRoom-to-Room(R2R)dataset with the Matterport3D Simula-tor 104.It consists of 21,567 navigation instructions with anaverage length of 29 words.In vision-dialog navigation 59,the Cooperative Vis
174、ion-and-Dialog Navigation(CVDN)105datasetisused.Itcomprises2,050human-to-humandialogsandover 7,000 trajectories within the Matterport3D Simulator.C.Embodied Question AnsweringThe task of embodied question answering(QA)in recentembodied AI simulators has been a significant advancement inthe field of
175、general-purpose intelligence systems.To performQA in a state of physical embodiment,an AI agent wouldneed to possess a wide range of AI capabilities such as visualrecognition,language understanding,question answering,com-monsense reasoning,task planning,and goal-driven navigation.Hence,embodied QA c
176、an be considered the most onerous andcomplicated task in embodied AI research currently.1)Categories:For embodied QA(EQA),a common frame-work that divides the task into two sub-tasks:a navigation taskand a QA task.The navigation module is essential since theagent needs to explore the environment to
177、see the objects beforeanswering questions about them.For example,61 proposedthe Planner-Controller Navigation Module(PACMAN),whichcomprises a hierarchical structure for the navigation module,with a planner that selects actions(directions)and a controllerthat decides how far to move following each ac
178、tion.Once theagent decide to stop,the QA module is executed by using thesequence of frames along its path.The navigation module andvisual question answering module are first trained individuallyand then jointly trained by REINFORCE 106.62 and 63further improved the PACMAN model with the Neural Modul
179、arControl(NMC)where the higher-level master policy proposessemantic sub-goals to be executed by sub-policies.Multi-targetembodiedQA(MT-EQA)63isamorecomplexembodied QA task,which studies questions that have multipletargets in them,e.g.“Is the apple in the bedroom bigger than theorange in the living r
180、oom?,”such that the agent has to navigateto the“bedroom”and the“living room”to localize the“apple”and the“orange”and then perform comparisons to answer thequestions.Interactive Question Answering(IQA)64 is another worktackling the task of embodied QA in the AI2-THOR envi-ronment.IQA is an extension
181、of EQA because it is essentialfor the agent to interact with the objects to answer certainquestions successfully(e.g.the agent needs to open the re-frigerator to answer the existence question“Is there an eggin the fridge?”).64 proposed using a Hierarchical InteractiveMemory Network(HIMN),which is a
182、hierarchy of controllersthat help the system operate,learn and reason across multipletime scales,while simultaneously reducing the complexity ofeach sub-task.An Egocentric Spatial Gated Recurrent Unit(GRU)acts as a memory unit for retaining spatial and semanticinformation of the environment.The plan
183、ner module will havecontrol over the other modules such as a navigator which runsan A*search to find the shortest path to the goal,a scannerwhichperformsrotationfordetectingnewimages,amanipulatorthat is invoked to carry out actions to change the state of theenvironmentandlastlyananswererthatwillansw
184、erthequestionposted to the AI agent.65 studied IQA from a multi-agentperspective,where several agents explore an interactive scenejointlytoansweraquestion.65proposedmulti-layerstructuraland semantic memories as scene memories to be shared bymultiple agents to first reconstruct the 3D scenes and then
185、perform QA.2)Evaluation Metrics:Embodied QA and IQA involve twosub-tasks:1)navigation,and 2)question answering,and thesetwo sub-tasks are evaluated based on different metrics.Navigationperformanceisevaluatedby:(1)distancetotargetat navigation termination,i.e.navigation error(dT);(2)changein distance
186、 to target from initial to final position,i.e.goalprogress(d);(3)smallest distance to target at any point inthe episode(dmin);(4)percentage of episodes agent terminatesnavigation foranswering before reaching the maximum episodelength(%stop);(5)percentage of questions where the agentterminates in the
187、 room containing the target object(%rT);(6)percentage of questions where the agent enters the roomcontaining the target object at least once(%re);(7)Intersectionover Union for target object(IoU);(8)hit accuracy based onIoU(hT);(9)episode length,i.e.trajectory length.Metrics(1),(2)and(9)are also used
188、 as evaluation metrics for the visualnavigation task.QA performance is evaluated by:(1)mean rank(MR)of theground-truth answer in predictions;(2)accuracy.3)Datasets:The EQA 61 dataset is based on House3D,a subset of the popular SUNCG 33 dataset with synthesizedrooms and layouts that is similar to the
189、 Replica dataset 107.House3D converts SUNCGs static environment into a virtualenvironment,where the agent can navigate with physical con-straints(e.g.it cannot pass through walls or objects).To test theagents capabilities in language grounding,commonsense rea-soning and navigation,61 uses a series o
190、f functional programsin CLEVR 108 to synthesize questions and answers regardingobjects and their properties(e.g.color,existence,location andrelative preposition).In total,there are 5,000 questions in 750environments with reference to 45 unique objects in 7 uniqueroom types.Authorized licensed use li
191、mited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARC
192、H TASKS11ForMT-EQA63,theauthorsintroducetheMT-EQAdataset,which contains 6 types of compositional questions which com-pare object attribute properties(color,size,distance)betweenmultiple targets(objects/rooms).For IQA 64,the authors annotated a large scale dataset,IQUAD V1,which consist of 75,000 mul
193、tiple-choice questions.Similar to the EQA dataset,IQUAD V1 has questions regardingobject existence,counting and spatial relationships.IV.INSIGHTS ANDCHALLENGESA.Insights Into Embodied AIThe interconnections in Fig.6 reflects the suitability of sim-ulators to research tasks.Based on Fig.6,both Habita
194、t-Simand iGibson support research tasks in visual exploration and arange of visual navigation tasks,indicating the importance ofhigh fidelity,which comes from world-based scene simulators.However,because of their distinct unique features that makethempreferablefornon-embodiedAIstandalonetaskssuchasi
195、ndeep reinforcement learning,some simulators do not presentlyconnect to any of the embodied research tasks.Nonetheless,they still meet the criteria for being classified as embodied AIsimulators.On the contrary,research tasks such as embodied questionanswering and visual navigation with priors would
196、require theembodied AI simulators to have multiple-state object property,due to the interactive nature of these tasks.Hence,AI2-THORis undoubtedly the simulator of choice.Lastly,VLN is theonly research task that currently does not utilize any of thenine embodied AI simulators but instead uses Matter
197、port3DSimulator 104.This is because previous works in VLN doesnot require the feature of interactivity in its simulator;henceMatterport3D simulator suffice.However,with the furtheranceof VLN tasks,we can expect the need for interactions in VLNtasks,hence the need to use embodied AI simulators.Fur-th
198、ermore,unlike traditional reinforcement learning simulationenvironments 41,109 focus on task specific training,whileembodied AI simulators provide a training environment fortraining a wide range of different tasks akin to those undertakenin the physical world.Furthermore,based on the survey done on
199、the embodied AIresearch tasks in Section III,we propose a pyramid structurein which each embodied AI research task contributes to thenext.Visual exploration,for example,aids in the developmentof visual navigation,and visual navigation contributes to thecreation of embodied QA.This build-up approach
200、also corre-lates with the increasing complexity of the tasks.Based on theforeseeabletrendsinembodiedAIresearch,wehypothesizethata next advancement in the pyramid of embodied AI research isTask-based Interactive Question Answering(TIQA),whichaims to integrate tasks with answering specific questions.F
201、orexample,such questions can be“How long would it take foran egg to boil?Is there an apple in the cabinet?”.These arequestions that cannot be answered through the conventionalapproaches 61,64.They require the embodied agent toperform specific tasks related to the questions to unlock newinsights that
202、 are momentous in answering those QA questions.The TIQA agents that we hypothesize can perform an arrayof general household tasks,which allows them to extrapolateuseful environmental information that is crucial in helping themtoderivetheanswertotheQAquestions.TIQAmayholdthekeyto generalizing task-pl
203、anning and developing general-purposeAIinsimulationswhichlatercanbedeployedintotherealworld.B.Challenges in Embodied AI SimulatorsCurrent embodied AI simulators have reached a level in bothits functionality and fidelity,that sets them apart from thoseconventional simulation used for reinforcement le
204、arning.Evenwith this soaring variance of embodied AI simulators,there areseveral existing challenges in embodied AI simulators in areasranging from their realism,scalability to interactivity.1)Realism:It focuses on the fidelity and physics features ofthe simulators.Simulators with both a high visual
205、 fidelity andrealistic physics are highly sought after by the robotics commu-nities as they provide the ideal test-bed for various robotic taskssuch as navigation and interaction tasks 110,111.However,there is a lack of embodied AI simulators that possess both ofworld-based scene and advanced physic
206、s.For fidelity,simulators that are world-based scene will un-doubtedlyoutperformgame-basedscenesimulatorinsimulationto real tasks 27,112.Despite this observation,only Habitat-Sim 17,and iGibson 18 are world-based scene simulators.This paucity of world-based scene simulators is the bottleneckto simul
207、ation-to-real tasks for embodied AI agents,which fur-ther hinders the transferability of embodied AI research intoreal-world deployment.For physics,the furtherance of physics-based predictive models 113115 have accentuate on theimportance of embodied AI simulators with advanced physicsfeatures as th
208、ey serve to provide an ideal testbed for trainingembodiedAIagentstoperformtaskswithsophisticatedphysicalinteractions 2,3,116.Despite the need for an advancedphysics-based embodied AI simulator,there is currently onlyone simulator,ThreeDWorld 20 that fits this criterion.Hence,there is a severe lack o
209、f embodied AI simulators with advancedphysics features such as cloth,fluid and soft-body physics.We believe that advances in 3D reconstruction techniques andphysicsengines117119willimprovetherealismofembod-ied AI.2)Scalability:Unlike image-based datasets 7,120 whichcan be easily obtained from crowd-
210、sourcing or the internet.Themethodologies and tools are scarce for collecting large-scaleworld-based3Dscenedatasetsand3Dobjectassets107,121,122.These 3D scene datasets are crucial for the constructionof a diverse of embodied AI simulators.Current approachesto collect realistic 3D scene datasets requ
211、ires scanning of thephysical room through photogrammetry 123 such as Matter-port 3D scanner,Meshroom 124,or even mobile 3D scanningapplications.However,they are not commercially viable forcollectinglargescale3Dobjectsandscenescans.Thisislargelydue to 3D scanners that are used for photogrammetry are
212、costlyand non-accessible.As such,the bottleneck to scalability liesin developing tools for large scale collection of high fidelity 3Dobject or scene scans.Hopefully,with the further advancementAuthorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Re
213、strictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.12IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCEof3Dlearning-basedapproaches125,126thataimstorender3D object meshes from
214、a single or few images or even throughscene generation approach 127,we will be able to scale up thecollection process of large scale 3D datasets.3)Interactivity:The ability to have fine-grained manipula-tive interactions with functional objects in the embodied AIsimulators are crucial in replicating
215、 human-level interactionswith real-world objects 128.Most game-based scene simu-lators 13,16,19,20 provides both fine-grained objectmanipulation capabilities and symbolic interaction capabilities(e.g.action)or simply a point-and-select.However,due to the nature of game-based scenesimulators,many res
216、earch tasks performed in this environmentwill opt for its symbolic interaction capabilities as compared tofine-grainedobjectmanipulation3,exceptforafewthatutilizeboth 2,129.On the other end,the agents from world-based scene simula-tors 17,18 possess the ability for gross motor control insteadof the
217、symbolic interaction capabilities.However,the objectproperty of the objects within these simulators being largelyinteract-ableonthesurfacewhichallowsforgrossmotorcontrolbut lacks the multi-state object classes which is number of statechanges that the object have.Hence,there is a need to strike abala
218、nceinboththeobjectfunctionalityinitsobjectpropertyandalso the complexity of action that the embodied AI agent canperform in the environment.Undoubtedly,mainstreamsimulatorssuchasAI2-THOR 13,iGibson 18,and Habitat-Sim 17 do provide anexcellent environment for advancing the respective embodiedAI resea
219、rch.However,they do have their strengths andlimitations to be overcome.With developments in computergraphicsandcomputervision,andtheintroductionofinnovativereal-world datasets,real-to-sim domain adaptation is one ofthe clear routes for improving embodied AI simulators.Theconcept of real-to-sim revol
220、ves around capturing real-worldinformationsuchastactileperception 130,human-level motorcontrol 131 and audio inputs 132 in addition to visualsensory inputs and integrating them for the development ofmore realistic embodied AI simulators that can effectivelybridge the physical and virtual worlds.C.Ch
221、allenges in Embodied AI ResearchEmbodied AI research tasks mark an increase in complexityfrom“internet AI”to autonomous embodied learning agents in3D simulated environments with multiple sensor modalities andpotentially long trajectories 22,34.This has led to memoryand internal representations of th
222、e agent becoming extremelyimportant 11,22,56.Long trajectories and multiple inputtypes also signified the importance of robust memory architec-ture which allows the agent to focus on the important parts ofits environment.In recent years,there has been many differenttypes of memory used,such as recur
223、rent neural networks 47,49,51,56,58,6163,attention-based memory ar-chitectures,52,60,72,anticipated occupancy maps 45,occupancy maps 22 and semantic maps 43,46,48,64,65,with some papers having overwhelming emphasis on thenovelty of their memory architectures 22,45,60,72.However,while recurrent neura
224、l networks are known to be lim-ited in capturing long-term dependencies in embodied AI 56,72,it is currently still hard to agree which memory type(s)are better 11 due to the lack of work focusing on memoryarchitectures.Among embodied AI research tasks,there has also been anincrease in complexity,as
225、seen in the progression from visualexploration to VLN and embodied QA where new componentslike language understanding and QA are added respectively.Each new component leads to exponentially harder and longertraining of AI agents,especially since current approaches areoften fully learning-based.This
226、phenomenon has led to twopromising advancements to reduce the search space and samplecomplexity while improving robustness:hybrid approachescombining classic and learning-based algorithms 44,74 andprior knowledge incorporation 23,57.Furthermore,abla-tionstudiesaremuchhardertomanage31formorecomplexta
227、sks as each new component in embodied AI makes it muchharder to test for its contribution to the agents performance,sinceitisaddedontoanexistingsetofcomponents,andembod-iedAIsimulatorsvarysignificantlyinfeaturesandissues.Thisiscompounded by the fact that research tasks have also increasedin number r
228、apidly.As a result,while some fundamental taskslike visual exploration have received more attention and thushave more approaches tackling them,the newer and more nichetasks like MT-EQA are much less addressed.New tasks usuallyintroducenewconsiderationsinimportantaspectslikemethods,evaluation metrics
229、 22,input types and model components,shown in Table III,thus requiring even more evaluation thansimpler tasks like visual exploration.Lastly,there is a lack of focus on multi-agent set-ups,whichcontribute useful new tasks 65.This lack of focus can beattributed to the lack of simulators with multi-ag
230、ent featuresuntil recently.Multi-agent systems for collaboration and com-munication are prevalent in the real world 133,134 butcurrentlyreceiverelativelylittleattention31.Withanincreasein simulators with multi-agent features 13,20,55 recently,it remains to be seen whether the multi-agent support(e.g
231、.support for multi-agent algorithms)is sufficient.V.CONCLUSIONRecent advances in embodied AI simulators have been akey driver of progress in embodied AI research.Aiming tounderstand the trends and gaps in embodied AI simulators andresearch,thispaperprovidesacontemporaryandcomprehensiveoverview of em
232、bodied AI simulators and research.The papersurveys nine embodied AI simulators and their connections inserving and driving recent innovations in research tasks forembodiedAI.BybenchmarkingnineembodiedAIsimulatorsintermsofsevenfeatures,weseektounderstandtheirprovisionofrealism,scalability and interac
233、tivity,and hence use in embodiedAI research.The three main tasks supporting the pyramid ofembodiedAIresearchvisualexploration,visualnavigationandembodiedQA,areexaminedintermsoftheirapproaches,evalu-ation metrics,and datasets.This is to review and benchmark theAuthorized licensed use limited to:ASTAR
234、.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS13exist
235、ing approaches in tackling these categories of embodied AIresearch tasks in the various embodied AI simulators.Further-more,this paper allows us to unveil insightful relations betweenthe simulators,datasets,and research tasks.With the aid of thispaper,AI researchers new to this field would be able t
236、o selectthemostsuitableembodiedAIsimulatorsfortheirresearchtasksand contribute back to advancing the field of embodied AI.REFERENCES1 L.Smith and M.Gasser,“The development of embodied cognition:Sixlessons from babies,”Artif.Life,vol.11,no.1/2,pp.1329,2005.2 J.Duan,S.Yu,H.L.Tan,and C.Tan,“Actionet:An
237、 interactiveend-to-end platform for task-based data collection and augmentationin 3D environment,”in Proc.IEEE Int.Conf.Image Process.,2020,pp.15661570.3 M.Shridhar et al.,“ALFRED:A benchmark for interpreting groundedinstructions for everyday tasks,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.
238、,2020,pp.1073710746.4 R.Pfeifer and F.Iida,“Embodied artificial intelligence:Trends and chal-lenges,”inEmbodiedArtificialIntelligence.Berlin,Heidelberg:Springer,2004,pp.126.5 J.Haugeland,Artificial Intelligence:The Very Idea.Cambridge,MA,USA:MIT Press,1985.6 R.Pfeifer and J.C.Bongard,How the Body Sh
239、apes the Way We Think-A New View on Intelligence.MIT press,2006.7 J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,and L.Fei-Fei,“ImageNet:Alarge-scalehierarchicalimagedatabase,”inProc.IEEEConf.Comput.Vis.Pattern Recognit.,2009,pp.248255.8 Y.LeCun,Y.Bengio,and G.Hinton,“Deep learning,”Nature,vol.521,no.7553,pp.4
240、36444,2015.9 K.He,X.Zhang,S.Ren,and J.Sun,“Deep residual learning for imagerecognition,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2016,pp.770778.10 D.Silver et al.,“Mastering the game of go without human knowledge,”Nature,vol.550,no.7676,pp.354359,2017.11 P.Andersonetal.,“Onevaluationofembodied
241、navigationagents,”2018,arXiv:1807.06757.12 C.Beattie et al.,“Deepmind lab,”2016,arXiv:1612.03801.13 E.Kolve et al.,“Ai2-thor:An interactive 3D environment for visual AI,”2017,arXiv:1712.05474.14 C.Yan,D.Misra,A.Bennnett,A.Walsman,Y.Bisk,and Y.Artzi,“Chalet:Cornell house agent learning environment,”2
242、018,arXiv:1801.07357.15 X.Puig et al.,“Virtualhome:Simulating household activities via pro-grams,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.84948502.16 X.Gao,R.Gong,T.Shu,X.Xie,S.Wang,and S.-C.Zhu,“Vrkitchen:An interactive 3D virtual environment for task-oriented learning,”2019,arXi
243、v:1903.05757.17 M.Savvaetal.,“Habitat:AplatformforembodiedAIresearch,”inProc.IEEE Int.Conf.Comput.Vis.,2019,pp.93399347.18 F.Xiaetal.,“InteractiveGibsonbenchmark:Abenchmarkforinteractivenavigationinclutteredenvironments,”IEEERobot.Automat.Lett.,vol.5,no.2,pp.713720,Apr.2020.19 F.Xiang et al.,“SAPIEN
244、:A simulated part-based interactive environ-ment,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.11 09711 107.20 C.Gan et al.,“Threedworld:A platform for interactive multi-modalphysical simulation,”2020,arXiv:2007.04954.21 M.G.Bellemare,Y.Naddaf,J.Veness,and M.Bowling,“The arcadelearning
245、environment:Anevaluationplatformforgeneralagents,”J.Artif.Intell.Res.,vol.47,pp.253279,2013.22 S.K.Ramakrishnan,D.Jayaraman,and K.Grauman,“An explo-ration of embodied visual exploration,”Int.J.Comput.Vis.,vol.129,no.5,pp.16161649,2021.Online.Available:https:/doi.org/10.1007/s11263-021-01437-z23 X.Ye
246、 and Y.Yang,“From seeing to moving:A survey on learning forvisual indoor navigation(VIN),”2020,arXiv:2002.11310.24 T.Chen,S.Gupta,and A.Gupta,“Learning exploration policies fornavigation,”2019,arXiv:1903.01959.25 N.Savinov,A.Dosovitskiy,andV.Koltun,“Semi-parametrictopologicalmemory for navigation,”i
247、n Proc.Int.Conf.Learn.Representations,2018,arXiv:1803.00653.26 E.Beeching,J.Dibangoye,O.Simonin,and C.Wolf,“Learning to planwith uncertain topological maps,”2020,arXiv:2007.05270.27 A.Kadian et al.,“Sim2real predictivity:Does evaluation in simulationpredict real-world performance,”IEEE Robot.Automat
248、.Lett.,vol.5,no.4,pp.66706677,Oct.2020.28 X.B.Peng,M.Andrychowicz,W.Zaremba,and P.Abbeel,“Sim-to-realtransfer of robotic control with dynamics randomization,”in Proc.IEEEInt.Conf.Robot.Automat.,2018,pp.18.29 J.Tobin,R.Fong,A.Ray,J.Schneider,W.Zaremba,and P.Abbeel,“Domain randomization for transferri
249、ng deep neural networks fromsimulation to the real world,”in Proc.IEEE/RSJ Int.Conf.Intell.RobotsSyst.,2017,pp.2330.30 CVPR,“Embodied AI workshop,”Jun.2021,Accessed:13 Jan 2022.Online.Available:https:/embodied-ai.org/31 L.Weihsetal.,“Allenact:AframeworkforembodiedAIresearch,”2020,arXiv:2008.12760.32
250、 K.Mo et al.,“Partnet:A large-scale benchmark for fine-grained andhierarchical part-level 3D object understanding,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2019,pp.909918.33 S.Song,F.Yu,A.Zeng,A.X.Chang,M.Savva,and T.Funkhouser,“Semanticscenecompletionfromasingledepthimage,”Proc.30thIEEEConf.C
251、omput.Vis.Pattern Recognit.,2017,pp.190198.34 A.Chang et al.,“Matterport3D:Learning from RGB-D data in indoorenvironments,”in Proc.Int.Conf.3D Vis.,2017,pp.667676.35 F.Xia,A.R.Zamir,Z.He,A.Sax,J.Malik,and S.Savarese,“Gibsonenv:Real-world perception for embodied agents,”in Proc.IEEE/CVFConf.Comput.Vi
252、s.Pattern Recognit.,2018,pp.90689079.36 A.Kadian et al.,“Are we making real progress in simulated environ-ments?Measuringthesim2realgapinembodiedvisualnavigation,”2019,arXiv:1912.06321.37 M.Deitke et al.,“Robothor:An open simulation-to-real embodied AIplatform,”in Proc.IEEE/CVF Conf.Comput.Vis.Patte
253、rn Recognit.,2020,pp.31643174.38 D.Batra et al.,“Objectnav revisited:On evaluation of embodied agentsnavigating to objects,”2020,arXiv:2006.13171.39 U.Jain etal.,“Acordialsync:Goingbeyondmarginalpoliciesformulti-agent embodied tasks,”in European Conference on Computer Vision,Cham,Switzerland:Springe
254、r,2020,pp.471490.40 U.Jain et al.,“Two body problem:Collaborative visual task comple-tion,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2019,pp.66826692,.41 G.Brockman et al.,“OpenAI gym,”2016,arXiv:1606.01540.42 U.Jain et al.,“Two body problem:Collaborative visual task comple-tion,”in Proc.IE
255、EE/CVF Conf.Comput.Vis.Pattern Recognit.,2019,pp.66826692.43 D.S.Chaplot,H.Jiang,S.Gupta,and A.Gupta,“Semantic curiosityfor active visual learning,”in Proc.Eur.Conf.Comput.Vis.,2020,pp.309326.44 D.S.Chaplot,D.Gandhi,S.Gupta,A.Gupta,and R.Salakhutdinov,“Learning to explore using active neural SLAM,”i
256、n Proc.Int.Conf.Learn.Representations,2020.45 S.K.Ramakrishnan,Z.Al-Halah,and K.Grauman,“Occupancy antici-pationforefficientexplorationandnavigation,”2020,arXiv:2008.09285.46 M.Narasimhan et al.,“Seeing the un-scene:Learning amodal semanticmaps for room navigation,”in Proc.Eur.Conf.Comput.Vis.,Cham,
257、Switzerland,Springer,2020,pp.513529.47 E.Wijmans et al.,“DD-PPO:Learning near-perfect pointgoalnavigatorsfrom 2.5 billion frames,”2019,arXiv:1911.00357.48 G.Georgakis,Y.Li,and J.Kosecka,“Simultaneous mapping and targetdriven navigation,”2019,arXiv:1911.07980.49 J.Ye,D.Batra,E.Wijmans,and A.Das,“Auxi
258、liary tasks speed uplearning pointgoal navigation,”2020,arXiv:2007.04561.50 C.Prez-DArpino,C.Liu,P.Goebel,R.Martn-Martn,andS.Savarese,“Robot navigation in constrained pedestrian environments using rein-forcement learning,”2020,arXiv:2010.08600.51 M.Wortsman,K.Ehsani,M.Rastegari,A.Farhadi,and R.Motta
259、ghi,“Learning to learn how to learn:Self-adaptive visual navigation usingmeta-learning,”inProc.IEEE/CVFConf.Comput.Vis.PatternRecognit.,2019,pp.67436752.52 T.Campari,P.Eccher,L.Serafini,and L.Ballan,“Exploiting scene-specific features for object goal navigation,”in Proc.Eur.Conf.Comput.Vis.Workshops
260、,2020,pp.406421.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.14IEEE TRANSACTION
261、S ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE53 H.Du,X.Yu,andL.Zheng,“Learningobjectrelationgraphandtentativepolicy for visual navigation,”in Proc.Eur.Conf.Comput.Vis.,Cham,Switzerland,Springer,2020,pp.1934.54 D.S.Chaplot,D.Gandhi,A.Gupta,and R.Salakhutdinov,“Ob-ject goal navigation using goal-
262、oriented semantic exploration,”2020,arXiv:2007.00643.55 B.Shen et al.,“iGibson,A simulation environment for interactive tasksin large realistic scenes,”Proc.IEEE/RSJ Int.Conf.Intell.Robots Syst.,2021.56 A.Wahid,A.Stone,K.Chen,B.Ichter,and A.Toshev,“Learningobject-conditioned exploration using distri
263、buted soft actor critic,”2020,arXiv:2007.14545.57 W.Yang,X.Wang,A.Farhadi,A.Gupta,and R.Mottaghi,“Visualsemantic navigation using scene priors,”2018,arXiv:1810.06543.58 P.Andersonetal.,“Vision-and-languagenavigation:Interpret-ing visually-grounded navigation instructions in real environments,”inProc
264、.IEEEConf.Comput.Vis.PatternRecognit.,2018,pp.36743683.59 F.Zhu,Y.Zhu,X.Chang,and X.Liang,“Vision-language navigationwithself-supervisedauxiliaryreasoningtasks,”inProc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,2020,pp.1000910019.60 Y.Zhu et al.,“Vision-dialog navigation by exploring cross-modal mem-
265、ory,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.10 72710 736.61 A.Das,S.Datta,G.Gkioxari,S.Lee,D.Parikh,andD.Batra,“Embodiedquestionanswering,”inProc.IEEEConf.Comput.Vis.PatternRecognit.Workshops,2018,pp.20542063.62 A.Das,G.Gkioxari,S.Lee,D.Parikh,and D.Batra,“Neural modularcontrol f
266、or embodied question answering,”in Proc.Conf.Robot Learn.,2018,pp.5362.63 L.Yu,X.Chen,G.Gkioxari,M.Bansal,T.L.Berg,and D.Ba-tra,“Multi-target embodied question answering,”in Proc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,Long Beach,CA,USA,2019,pp.63026311.64 D.Gordon,A.Kembhavi,M.Rastegari,J.Redmon,
267、D.Fox,andA.Farhadi,“IQA:Visual question answering in interactive environ-ments,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.40894098.65 S.Tan,W.Xiang,H.Liu,D.Guo,and F.Sun,“Multi-agent embodiedquestion answering in interactive environments,”in Proc.Comput.Vis.ECCV:16th Eur.Conf.,Glasg
268、ow,UK,2020,pp.663678.66 P.D.Nguyen,Y.K.Georgie,E.Kayhan,M.Eppe,V.V.Hafner,andS.Wermter,“Sensorimotor representation learning for an“active self”in robots:A model survey,”KI-Knstliche Intelligenz,vol.35,no.1,pp.935,2021.67 C.Chen et al.,“Soundspaces:Audio-Visual Navigation in 3D Environ-ments,”in ECC
269、V,Glasgow,U.K.,Springer,2020,pp.1736.68 D.Pathak,P.Agrawal,A.A.Efros,and T.Darrell,“Curiosity-drivenexploration by self-supervised prediction,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.Workshops,2017,pp.1617.69 S.Gupta,D.Fouhey,S.Levine,and J.Malik,“Unifying mapand landmark based representations
270、 for visual navigation,”2017,arXiv:1712.08125.70 S.Gupta,J.Davidson,S.Levine,R.Sukthankar,and J.Malik,“Cogni-tive mapping and planning for visual navigation,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2017,pp.72727281.71 J.F.Henriques and A.Vedaldi,“MapNet:An allocentric spatial memoryfor mappin
271、g environments,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.84768484.72 K.Fang,A.Toshev,L.Fei-Fei,and S.Savarese,“Scene memory trans-former for embodied agents in long-horizon tasks,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2019,pp.538547.73 L.Mezghani,S.Sukhbaatar,A.Szlam,A.Jou
272、lin,and P.Bojanowski,“Learning to visually navigate in photorealistic environments withoutany supervision,”2020,arXiv:2004.04954.74 D.Mishkin,A.Dosovitskiy,and V.Koltun,“Benchmarking clas-sic and learned navigation in complex 3D environments,”2019,arXiv:1901.10915.75 C.Cadena et al.,“Past,present,an
273、d future of simultaneous localizationand mapping:Toward the robust-perception age,”IEEE Trans.Robot.,vol.32,no.6,pp.13091332,Dec.2016.76 S.K.Ramakrishnan,D.Jayaraman,and K.Grauman,“Emergence of ex-ploratorylook-aroundbehaviorsthroughactiveobservationcompletion,”Sci.Robot.,vol.4,pp.30,2019.77 W.S.Lov
274、ejoy,“A survey of algorithmic methods for partially observedMarkov decision processes,”Ann.Oper.Res.,vol.28,no.1,pp.4765,1991.78 B.Yamauchi,“A frontier-based approach for autonomous exploration,”inProc.IEEEInt.Symp.Comput.Intell.Robot.Automat.,1997,pp.146151.79 Y.Burda,H.Edwards,D.Pathak,A.Storkey,T
275、.Darrell,and A.A.Efros,“Large-scale study of curiosity-driven learning,”2018,arXiv:1808.04355.80 R.Houthooft,X.Chen,Y.Duan,J.Schulman,F.DeTurck,andP.Abbeel,“VIME:Variational information maximizing exploration,”in Proc.Adv.Neural Inf.Process.Syst.,2016,pp.11091117.81 D.Pathak,D.Gandhi,and A.Gupta,“Se
276、lf-supervised exploration viadisagreement,”in Proc.Int.Conf.Mach.Learn.,2019,pp.50625071.82 Y.Burda,H.Edwards,A.Storkey,and O.Klimov,“Exploration byrandom network distillation,”2018,arXiv:1810.12894.83 A.Vaswani et al.,“Attention is all you need,”in Proc.Adv.Neural Inf.Process.Syst.,2017,pp.59986008
277、.84 S.K.RamakrishnanandK.Grauman,“Sidekickpolicylearningforactivevisual exploration,”in Proc.Eur.Conf.Comput.Vis.,2018,pp.413430.85 S.Song,A.Zeng,A.X.Chang,M.Savva,S.Savarese,andT.Funkhouser,“Im2Pano3D:Extrapolating360structureandsemanticsbeyondthefieldofview,”inProc.IEEE/CVFConf.Comput.Vis.PatternR
278、ecognit.,2018,pp.38473856.86 N.Savinov et al.,“Episodic curiosity through reachability,”2018,arXiv:1810.02274.87 N.Haber,D.Mrowca,S.Wang,L.F.Fei-Fei,andD.L.Yamins,“Learningto play with intrinsically-motivated,self-aware agents,”in Proc.Adv.Neural Inf.Process.Syst.,2018,pp.83888399.88 Y.Zhu et al.,“T
279、arget-driven visual navigation in indoor scenes usingdeep reinforcement learning,”in Proc.IEEE Int.Conf.Robot.Automat.,2017,pp.33573364.89 D.S.Chaplot,R.Salakhutdinov,A.Gupta,and S.Gupta,“Neural topo-logical SLAM for visual navigation,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2020,pp.12 87
280、212 881.90 F.Bonin-Font,A.Ortiz,and G.Oliver,“Visual navigation for mobilerobots:A survey,”J.Intell.Robot.Syst.,vol.53,no.3,pp.263296,2008.91 J.Fuentes-Pacheco,J.Ruiz-Ascencio,and J.M.Rendn-Mancha,“Vi-sualsimultaneouslocalizationandmapping:Asurvey,”Artif.Intell.Rev.,vol.43,no.1,pp.5581,2015.92 L.E.K
281、avraki,P.Svestka,J.-C.Latombe,and M.H.Overmars,“Proba-bilistic roadmaps for path planning in high-dimensional configurationspaces,”IEEE Trans.Robot.Automat.,vol.12,no.4,pp.566580,Aug.1996.93 S.M.LaValle and J.J.Kuffner,“Rapidly-exploring random trees:Progress and prospects,”Algorithmic Comput.Robot.
282、,New Directions,vol.5,pp.293308,2001.94 A.Dosovitskiy and V.Koltun,“Learning to act by predicting the future,”2016,arXiv:1611.01779.95 P.Dayan,“Improving generalization for temporal difference learn-ing:The successor representation,”Neural Comput.,vol.5,no.4,pp.613624,1993.96 Y.Zhu et al.,“Visual se
283、mantic planning using deep successor represen-tations,”in Proc.IEEE Int.Conf.Comput.Vis.,2017,pp.483492.97 A.Barreto et al.,“Successor features for transfer in reinforcementlearning,”Adv.Neural Inf.Process.Syst.,vol.30,pp.40554065,2017.98 D.Gordon,A.Kadian,D.Parikh,J.Hoffman,and D.Batra,“SplitNet:Si
284、m2Simandtask2Tasktransferforembodiedvisualnavigation,”inProc.IEEE/CVF Int.Conf.Comput.Vis.,2019,pp.10221031.99 J.Schulman,F.Wolski,P.Dhariwal,A.Radford,and O.Klimov,“Prox-imal policy optimization algorithms,”2017,arXiv:1707.06347.100 A.Mousavian,A.Toshev,M.Fier,J.Koeck,A.Wahid,and J.David-son,“Visua
285、l representations for semantic target driven navigation,”inProc.Int.Conf.Robot.Automat.,2019,pp.88468852.101 J.Schulman,P.Moritz,S.Levine,M.Jordan,and P.Abbeel,“High-dimensionalcontinuouscontrolusinggeneralizedadvantageestimation,”2015,arXiv:1506.02438.102 Z.D.Guo,M.G.Azar,B.Piot,B.A.Pires,and R.Mun
286、os,“Neuralpredictive belief representations,”2018,arXiv:1811.06407.103 C.Gan,Y.Zhang,J.Wu,B.Gong,and J.B.Tenenbaum,“Look,listen,and act:Towards audio-visual embodied navigation,”in Proc.IEEE Int.Conf.Robot.Automat.,2020,pp.97019707.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2
287、022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.This article has been accepted for inclusion in a future issue of this journal.Content is final as presented,with the exception of pagination.DUAN et al.:SURVEY OF EMBODIED AI:FROM SIMULATORS TO RESEARCH TASKS15104 P.Anderson et al.,“Vision-and
288、-language navigation:Interpretingvisually-groundednavigationinstructionsinrealenvironments,”inProc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2018,pp.36743683.105 J.Thomason,M.Murray,M.Cakmak,and L.Zettlemoyer,“Vision-and-dialog navigation,”in Proc.Conf.Robot Learn.,2019,pp.394406.106 R.J.Williams,“
289、Simple statistical gradient-following algorithms forconnectionist reinforcement learning,”Mach.Learn.,vol.8,no.3/4,pp.229256,1992.107 J.Straub et al.,“The replica dataset:A digital replica of indoor spaces,”2019,arXiv:1906.05797.108 J.Johnson,B.Hariharan,L.vanderMaaten,L.Fei-Fei,C.L.Zitnick,andR.Gir
290、shick,“CLEVR:A diagnostic dataset for compositional languageand elementary visual reasoning,”in Proc.IEEE Conf.Comput.Vis.Pattern Recognit.,2017,pp.29012910.109 E.Todorov,T.Erez,andY.Tassa,“MuJoCo:A.physicsengineformodel-based control,”in Proc.IEEE/RSJ Int.Conf.Intell.Robots Syst.,2012,pp.50265033.1
291、10 C.Gao,J.Chen,S.Liu,L.Wang,Q.Zhang,and Q.Wu,“Room-and-object aware knowledge reasoning for remote embodied referringexpression,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2021,pp.30643073.111 J.Sun,D.-A.Huang,B.Lu,Y.-H.Liu,B.Zhou,and A.Garg,“PlaTe:Visually-groundedplanningwithtransformersi
292、nproceduraltasks,”2021,arXiv:2109.04869.112 F.Sadeghi and S.Levine,“CAD2RL:Real single-image flight without asingle real image,”Robot.,Sci.Syst.XIII,Massachusetts Inst.Technol.,Cambridge,MA,USA,Jul.2017.113 D.M.Bear et al.,“Physion:Evaluating physical prediction from visionin humans and machines,”20
293、21,arXiv:2106.08261.114 J.Duan,S.Y.B.Jian,and C.Tan,“SPACE:A simulator for physicalinteractionsandcausallearningin3Denvironments,”inProc.IEEE/CVFInt.Conf.Comput.Vis.,2021,pp.20582063.115 J.Duan,S.Yu,S.Poria,B.Wen,and C.Tan,“PIP:Physical in-teraction prediction via mental imagery with span selection,
294、”2021,arXiv:2109.04683.116 T.Nagarajan and K.Grauman,“Learning affordance landscapes forinteraction exploration in 3D environments,”Adv.Neural Inf.Process.Syst.,vol.33,2020.117 M.Wang,Y.Deng,X.Kong,A.H.Prasad,S.Xiong,and B.Zhu,“Thin-film smoothed particle hydrodynamics fluid,”2021,arXiv:2105.07656.1
295、18 A.Kuznetsov,K.Mullia,Z.Xu,M.Haan,and R.Ramamoorthi,“Neu-MIP:Multi-resolution neural materials,”ACM Trans.Graph.,vol.40,no.4,pp.113,2021.119 S.R.Richter,H.A.AlHaija,and V.Koltun,“Enhancing photorealismenhancement,”2021,arXiv:2105.04619.120 T.-Y.Lin et al.,“Microsoft coco:Common objects in context,
296、”in Proc.Eur.Conf.Comput.Vis.,Cham,Switzerland,Springer,2014,pp.740755.121 S.K.Ramakrishnanetal.,“Habitat-matterport3Ddataset(HM3D):1000large-scale3DenvironmentsforembodiedAI,”2021,arXiv:2109.08238.122 I.Armeni,S.Sax,A.R.Zamir,and S.Savarese,“Joint 2D-3D-semanticdata for indoor scene understanding,”
297、2017,arXiv:1702.01105.123 K.Atkinson,“Introduction to modern photogrammetry,”Wiley OnlineLibrary,New York,NY,USA,vol.19,2003.124 Alicevision,“Blender-A 3D modelling and rendering package,Al-icevision,”2018.Online.Available:https:/ A.Yu,V.Ye,M.Tancik,and A.Kanazawa,“PixelNeRf:Neural radiancefields fr
298、om one or few images,”in Proc.IEEE/CVF Conf.Comput.Vis.Pattern Recognit.,2021,pp.45784587.126 R.Martin-Brualla,N.Radwan,M.S.Sajjadi,J.T.Barron,A.Dosovit-skiy,and D.Duckworth,“NeRF in the wild:Neural radiance fields forunconstrainedphotocollections,”inProc.IEEE/CVFConf.Comput.Vis.Pattern Recognit.,20
299、21,pp.72107219.127 Z.Hao,A.Mallya,S.Belongie,andM.-Y.Liu,“GANcraft:Unsupervised3D neural rendering of minecraft worlds,”2021,arXiv:2104.07659.128 Y.Zhu et al.,“Dark,beyond deep:A paradigm shift to cognitive AIwithhumanlikecommonsense,”Engineering,vol.6,no.3,pp.310345,2020.129 M.Lohmann,J.Salvador,A.
300、Kembhavi,and R.Mottaghi,“Learningaboutobjectsbylearningtointeractwiththem,”Adv.NeuralInf.Process.Syst.,vol.33,2020.130 R.Bhirangi,T.Hellebrekers,C.Majidi,andA.Gupta,“Reskin:Versatile,replaceable,lasting tactile skins,”in CoRL,2021.131 B.Smithetal.,“Constrainingdensehandsurfacetrackingwithelasticity,
301、”ACM Trans.Graph.,vol.39,no.6,pp.114,2020.132 C.Chen et al.,“Audio-visual embodied navigation,”Environment,vol.97,p.103,2019.133 L.Panait and S.Luke,“Cooperative multi-agent learning:The state ofthe art,”Auton.Agents Multi-Agent Syst.,vol.11,no.3,pp.387434,2005.134 Y.-C.Liu,J.Tian,C.-Y.Ma,N.Glaser,C
302、.-W.Kuo,and Z.Kira,“Who2com:Collaborativeperceptionvialearnablehandshakecommuni-cation,”in Proc.IEEE Int.Conf.Robot.Automat.,2020,pp.68766883.Jiafei Duan received the B.Eng.(Highest Distinc-tion)degree from the School of Electrical and Elec-tronics Engineering,Nanyang Technological Univer-sity,Singa
303、pore,in 2021.He is currently a ResearchEngineer with the Institute of Infocomm Research,Agency for Science,Technology and Research,Sin-gapore.His current research interests included em-bodied AI and computational cognitive science.SamsonYureceivedtheB.Eng.degreeininformationsystems technology and de
304、sign from the SingaporeUniversity of Technology and Design,Singapore,in2020.HeiscurrentlyaResearchEngineerwiththeIn-stitute of High Performance Computing,Agency forScience,Technology and Research,Singapore.Hisresearch interests include fundamental AI researchand embodied AI.Hui Li Tan received the B
305、.Sc.degree in appliedmathematics and the Ph.D.degree in electrical andcomputer engineering from the National Universityof Singapore,Singapore,in 2007 and 2017,respec-tively.Since 2007,she has been with the Institute forInfocommResearch,Singapore.Hercurrentresearchinterests include computer vision,mu
306、ltimodal deeplearning,incremental and federated learning.Hongyuan Zhu received the Ph.D.degree in com-puter engineering from Nanyang Technological Uni-versity,Singapore,in2014.HeiscurrentlyaResearchScientist with the Institute for Infocomm Research,A*STAR,Singapore.His research interests includemult
307、imedia content analysis and segmentation.Cheston Tan received the B.Sc.(Highest Hons.)degree from the Department of Electrical Engineer-ing and Computer Science,University of California,Berkeley,CA,USA,and the Ph.D.degree from theDepartment of Brain and Cognitive Sciences,Mas-sachusetts Institute of Technology,Cambridge,MA,USA.He is currently a Senior Scientist with theInstituteforInfocommResearch,AgencyforScience,Technology and Research,Singapore.Authorized licensed use limited to:ASTAR.Downloaded on February 22,2022 at 04:30:42 UTC from IEEE Xplore.Restrictions apply.