《如何使用 FSDP 集群設置 RDMA CI.pdf》由會員分享,可在線閱讀,更多相關《如何使用 FSDP 集群設置 RDMA CI.pdf(31頁珍藏版)》請在三個皮匠報告上搜索。
1、HOW TO SETUP RDMA CI USING THE FSDP CLUSTERDoug Ledford,Chair OFARed Hat,Inc.2024 OFA Virtual WorkshopSTEPS REQUIRED TO HOOK YOUR TESTS INTO THE KCI INFRACreate your testsAdd your tests to the kpetdbAdd your pipeline to the ofa pipeline in kciSit back and wait for results2 OpenFabrics AllianceCREATI
2、NG YOUR TEST3 OpenFabrics AllianceCREATING YOUR TESTWatch the first video on the left to see how to get setup with your account and access the FSDPWatch the second and third videos for an introduction to the testing API(everything is based on the Restraint test harness as shipped with Beaker:https:/
3、restraint.readthedocs.io/en/latest/)Clone the FSDP test repo:https:/ repo clone OpenFabrics/fsdp_testsAdd your own unique directory for your testsYou will need 1 directory per unique test,mainly for selecting the pool specific to how you want to run your tests,may use a shared directory for the majo
4、rity of the test itselfEg.Redis/common-Generic copy of the test that can be shared Redis/ib-mlx -Could use either generic or specific pool here as there is only one IB vendor in the cluster Redis/roce-mlx-Need to use a specific pool to limit devices to mlx roce Redis/roce-qedr-Need to use a specific
5、 pool to limit devices to qedr roce Redis/iwarp /dev/null;then echo*client test start*echo*$TNAME-$RDMA_NETWORK-$RUN_NUMBER*client TEST=$TEST/sanity/clientelif hostname-A|grep$SERVERS%.*/dev/null;then echo*server test start*echo*$TNAME-$RDMA_NETWORK-$RUN_NUMBER*server TEST=$TEST/sanity/serverfifsdp_
6、tests/kernel/infiniband/demo/runtest.sh OpenFabrics Alliance5function server#server get ready echo-server is ready-$TNAME-$RDMA_NETWORK-$RUN_NUMBER-rhts_sync_set-s server-ready_$TNAME-$RDMA_NETWORK-$RUN_NUMBER rhts_sync_block-s client-ready_$TNAME-$RDMA_NETWORK-$RUN_NUMBER$CLIENTS#do the sanity test
7、bash-x./tier1.sh server#Report the result echo-server finishes-$TNAME-$RDMA_NETWORK-$RUN_NUMBER-rhts_sync_block-s client-done_$TNAME-$RDMA_NETWORK-$RUN_NUMBER$CLIENTSCOPYING THE DEMO TESTS TO NEW TESTSfunction common_tests#ensure appropriate ib/iw/en/opa drivers are available in this kernel driver_m
8、odules=$(RQA_get_driver_modules)for module in$driver_modules;do lsmod|grep-i$module RQA_check_result-r$?-t load module$module doneThis is the appropriate place to build/install user space packages needed for your testSpeci f i cal l y,f or t he r edi s packages J er emy has bui l t:wget -no-check-ce
9、r t i f i cat e ht t ps:/bui l der-00.of a.i ol.unh.edu/j s pewock/s er ver.r pm&yum i ns t al l -y./s er ver.r pmAndwget -no-check-cer t i f i cat e ht t ps:/bui l der-00.of a.i ol.unh.edu/j s pewock/cl i ent.r pm&yum i ns t al l -y./cl i ent.r pmThi s command s houl d be added t o t he common s ec
10、t i on.Af t er t he common s ect i on,but bef or e t he t es t s ar e r un,we wai t f or an“al l r eady”s ync up bet ween s er ver and cl i ent,s o bui l di ng and i ns t al l i ng our code we wi s h t o t es t dur i ng t he common s ect i on wor ks f i nefsdp_tests/kernel/infiniband/demo/tier1.sh O
11、penFabrics Alliance6COPYING THE DEMO TESTS TO NEW TESTSfunction server_tests#ping client.If it fails,the following multi-host tests would fail.if -z$CLIENT_IPV4;thenfunction client_tests#ping server.If it fails,the following multi-host tests would fail.if -z$SERVER_IPV4;thenThis is the appropriate p
12、lace to hook in your test commandsSpeci f i cal l y,f or t he r edi s packages J er emy has bui l t:Ser ver t es t command:r edi s-s er ver -por t 6379 -l oadmodul e/et c/r edi s/r edi s-r dma.s o por t=6379 bi nd=XX.XX.XX.XX-l ogl evel ver bos e-pr ot ect ed-mode no-s er ver _cpul i s t 2-bi o_cpul
13、 i s t 3-aof _r ewr i t e_cpul i s t 3-bgs ave_cpul i s t 3-appendonl y noCl i ent t es t command:r edi s-benchmar k-h xx.xx.xx.xx-p 6379-c 30-n 10000000-t hr eads 4 -d 1024-t pi ng,get,s et -r dmafsdp_tests/kernel/infiniband/demo/tier1.sh OpenFabrics Alliance7ADDING YOUR TEST TO KPETDB8 OpenFabrics
14、 AllianceCreating host_types in kpetdb/hosts_types(if needed)Example creation of roce-mlx hosts_type fileExisting host_types:ofa-ib.host_requires.xml.j2ofa-iwarp.host_requires.xml.j2ofa-opa.host_requires.xml.j2ofa-roce.host_requires.xml.j2This only pull in generic hosts on a fabric.For specific host
15、s on a specific fabric(eg.for mlx specific roce),you will need to create a host type that pulls the rights hosts in using a beaker pool specifier specific to the hosts you wantofa-roce-mlx.host_requires.xml.j2:You can see all of the available pool types by checking the pools out on the beaker web in
16、terface for the FSDP cluster.If we need to create additional pools to specific specific host hardware,that is easily doable.HTTPS:/GITLAB.COM/REDHAT/CENTOS-STREAM/TESTS/KERNEL/KPET-DB OpenFabrics Alliance9If you created a new hosts_types,you also need to link it into the host_types in index.yamlExam
17、ple linking of roce-mlx hosts_type file,add the following to index.yaml#OFA RoCE tests ofa-roce_1:domains:ofa host_requires:host_types/ofa-roce.host_requires.xml.j2 ofa-roce_2:domains:ofa host_requires:host_types/ofa-roce.host_requires.xml.j2#OFA RoCE-MLX tests ofa-roce-mlx_1:domains:ofa host_requir
18、es:host_types/ofa-roce-mlx.host_requires.xml.j2 ofa-roce-mlx_2:domains:ofa host_requires:host_types/ofa-roce-mlx.host_requires.xml.j2 HTTPS:/GITLAB.COM/REDHAT/CENTOS-STREAM/TESTS/KERNEL/KPET-DB OpenFabrics Alliance10And you nbeed to create a recipeset for the new hosts in index.yaml:Then create your
19、 test entry in the cases directoryrecipesets:multihost:ofa_roce:-ofa-roce_1-ofa-roce_2ofa_roce_mlx:-ofa-roce-mlx_1-ofa-roce-mlx_2mkdir p cases/ofa/redis/roce -Please keep ofa tests under the ofa parent directoryvi cases/ofa/redis/roce/index.yaml-Yes,Vi does in fact rule the universe!HTTPS:/GITLAB.CO
20、M/REDHAT/CENTOS-STREAM/TESTS/KERNEL/KPET-DB OpenFabrics Alliance11Sample cases/ofa/redis/roce/index.yamlFieldsname:Redis RoCElocation:redistarget_sources:-drivers/infiniband/core/.*-drivers/infiniband/hw/.*max_duration_seconds:7200environment:ENV_DRIVER:”cases:RoCE:cases/ofa/hosts/RoCE/index.yamlset
21、s:ofa-roceName:Name of your test,and needs to be uniqueLocation:top level directory in the fsdp_tests repo for your test(this need not be fabric specific)Target Sources:kernel directories that,if they change,should trigger retesting(not used on user space repos)Max Duration Seconds:how long until a
22、hung test times outEnvironment:Random items you need set for your testCases:Can point to sub-tests in the kpetdb cases subtree.HOWEVER!All sub-tests need to have the same machine requirements.If you mix a test that requires IB with a test that requires RoCE,the scheduler will be unable to resolve a
23、machine to satisfy the test and the tests will just get skippedSets:Upper level test sets that this test claims to be a part of.Test sets are defined in the main index.yaml fileHTTPS:/GITLAB.COM/REDHAT/CENTOS-STREAM/TESTS/KERNEL/KPET-DB OpenFabrics Alliance12Link the cases/ofa/redis/roce/index.yaml
24、into the cases/ofa/index.yaml filesets:-ofacases:ofa_kernel_rdma:cases/ofa/kernel-rdma/index.yaml ofa_user_rdma:cases/ofa/user-rdma/index.yaml ofa_libfabric:cases/ofa/libfabric/index.yamlofa_redis_roce:cases/ofa/redis/roce/index.yamlNote:The existing kernel_rdma,user_rdma,and libfabric entries combi
25、ne all the different fabrics under a single grouping and are,as a result,all skipped.This is known.Dont follow their example if you want your tests to actually be run.We will split them out to the individual fabric types when we enable these tests.HTTPS:/GITLAB.COM/REDHAT/CENTOS-STREAM/TESTS/KERNEL/
26、KPET-DB OpenFabrics Alliance13ADDING TEST TO OFA PIPELINE14 OpenFabrics AllianceLink new test(s)into the ofa pipelines by adding the follow to the ofa.yaml fileredis-roce:-keep the fabric name in the test name since you will need a separate test for each fabric you want to run on git_url:https:/git.
27、kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git -should point to the upstream rdma kernel tree.This item is used to build the latest kernel for your tests tests_regex:OFA Redis RoCE.*-This name must match OFA-from your keptdb index.yaml name entry .branches:-for-next-This is the kernel to build an
28、d test against watch_url:https:/ watch_branch:HTTPS:/GITLAB.COM/CKI-PROJECT/PIPELINE-DATA OpenFabrics Alliance15WHEN ITS ALL SAID AND DONE16 OpenFabrics AllianceThe CKI pipeline will watch your watch_url repo on watch_branch branch.When new commits are added,it triggers a run.First step in the run i
29、s to build the kernel specified at the git_url and.branches option(or reuse pre-built artifacts if they exist)When the artifacts are ready,the system pulls up your test in kpetdb to get the necessary test infoThe system then uses that to find your actual test files and pull out any needed beaker xls
30、/xml filesThe system then attempts to process the test info to generate beaker xml files suitable for automated job submissionThe system will use the requirements and restrictions in the hosts_types and other fields to select the right hosts and the right cluster for the jobThe system then attempts
31、to submit the job to the beaker controllerOnce submitted,beaker attempts to run the jobs,which will result in the tests tarball being downloaded and installedFrom the contents of the tarball,your specific runtest.sh script will be ran using the top level dir.you specified in the kpetdb entry for you
32、r testAt this point,your test script is in control and can free-form run the tests as you see fitTHE PROCESS,END TO END OpenFabrics Alliance17THANK YOU2024 OFA Virtual WorkshopHow to do Manual RDMA Testing Using the FSDP ClusterJeremy Spewock,Lead DeveloperUNH InterOperability Lab(IOL)2024 OFA Virtu
33、al WorkshopOverviewIntroduction To The ClusterBrief overview of the hardware availableShowcase of pre-configured fabrics Tour of BeakerWhat is Beaker?Why do we use Beaker?How do I reserve nodes using Beaker?Redis and RDMABrief explanation of RedisRDMA support in redis Demo2 OpenFabrics AllianceTHE C
34、LUSTER3 OpenFabrics AllianceConsists of 10 nodes and a build serverNodesBare-metal hosts with RDMA cardsGrouped by RDMA fabricProvisioned by BeakerBuilder-00Static environment for building binariesPersistent home directory for your userpublic_html for easy transportation of binariesImage is physical
35、 section of the cluster network image from the FSDP Docs GitHub repository(https:/ FSDP ClusterHardware4 OpenFabrics AllianceNodes can be grouped by fabriciwarpInfinibandRoCEOmni-PathImage is software section of the cluster network image from the FSDP Docs GitHub repository(https:/ FSDP ClusterSoftw
36、are/Fabrics5 OpenFabrics AllianceBEAKER6 OpenFabrics AllianceBeaker OverviewWhat is Beaker?Open-source software for managing automated labsCapable of storing multiple OS distributions to select from when provisioning hostsMaintains inventories of hardware on nodesAble to store a library of tasks for
37、 ease of scheduling testing across multiple environments Why do we use it?Makes scheduling the same test on multiple fabrics simpleHosts can be grouped into“pools”and jobs scheduled from hosts within a poolAllows us to create pools based on fabric for convenienceAllows for testing the difference in
38、functionality between distrosXML style for writing jobs is easy to follow and reproduceAbility to specify what are called“snippets”in beaker7 OpenFabrics AllianceThe manual processUse the built in task called“reserve”in the web GUI Provisions the host for access via SSHGives you a non-persistent tes
39、ting environmentAble to reach HTTP endpoints on builder-00 The automated processWrite XML jobs using beakers formatCan create custom tasks and package them into RPMsAllows for easy re-use and sharing of tasksHost provisioned for the task,the task is run,and then host is immediately freed*Images illu
40、strate where to access both the manual and automated processes for tasks,taken from the FSDP beaker instance(https:/beaker.ofa.iol.unh.edu/bkr/).*Some tasks can be used to reserve the system for longer periods of time.Beaker OverviewPrimary Testing Methods8 OpenFabrics AllianceTESTING RDMA SUPPORT I
41、N REDIS9 OpenFabrics AllianceWhat is Redis?RedisHigh-speed in-memory databaseA“data structures server”A server that exposes multiple mutable data structures via a set of commandsServer-client model that uses a simple TCP protocol for sending commandsNon-volatileEven if the data structures are always
42、 in-memory,redis still saves them onto diskMaintains speed while still keeping data10 OpenFabrics AllianceRedis and RDMA RDMA support in RedisTicket for RDMA support opened in July,2021(issue 9292)Pull request(PR 11182)was opened for server-side support with 2 main blockers:A general lack of knowled
43、ge on RDMA from Redis maintainersA call for testing the RDMA support across different fabricsOnly tested by the submitter on RXE(soft-RoCE)and RoCE using a ConnectX-5 net cardAWS has a partial implementation of proprietary RDMA capabilityAzure only supports Mellanox InfinibandThis is where the FSDP cluster comes in!11 OpenFabrics AllianceDEMO12 OpenFabrics AllianceTHANK YOUJeremy Spewock,Lead DeveloperUNH InterOperability Lab2024 OFA Virtual Workshop