《會議9_Cornelis Networks CN5000 適配器和軟件更新.pdf》由會員分享,可在線閱讀,更多相關《會議9_Cornelis Networks CN5000 適配器和軟件更新.pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、Cornelis Networks CN5000 Adapter and Software UpdateDennis Dalessandro,Kernel SW Dev Manager2024 OFA Virtual WorkshopINFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH CORNELIS NETWORKS PRODUCTS.NO LICENSE,EXPRESS OR IMPLIED,BY ESTOPPEL OR OTHERWISE,TO ANY INTELLECTUAL PROPERTY RIGHTS IS GR
2、ANTED BY THIS DOCUMENT.EXCEPT AS PROVIDED IN CORNELIS NETWORKSS TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,CORNELIS NETWORKS ASSUMES NO LIABILITY WHATSOEVER,AND CORNELIS NETWORKS DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY,RELATING TO SALE AND/OR USE OF CORNELIS NETWORKS PRODUCTS INCLUDING LIABILI
3、TY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,MERCHANTABILITY,OR INFRINGEMENT OF ANY PATENT,COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.CORNELIS NETWORKS PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL,LIFE SAVING,OR LIFE SUSTAINING APPLICATIONS.Cornelis Networks may make changes to
4、specifications and product descriptions at any time,without notice.Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined.Cornelis Networks reserves these for future definition and shall have no responsibility whatsoever for conflicts o
5、r incompatibilities arising from future changes to them.The information here is subject to change without notice.Do not finalize a design with this information.All products,dates,and figures specified are preliminary based on current expectations,and are subject to change without notice.Roadmap not
6、reflective of exact launch granularity and timing.The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.Current characterized errata are available on request.Any code names featured are used i
7、nternally within Cornelis Networks to identify products that are in development and not yet publicly announced for release.Customers,licensees and other third parties are not authorized by Cornelis Networks to use code names in advertising,promotion or marketing of any product or services and any su
8、ch use of Cornelis Networks internal code names is at the sole risk of the user.All products,computer systems,dates and figures specified are preliminary based on current expectations and are subject to change without notice.Material in this presentation is intended as product positioning and not ap
9、proved end user messaging.Performance tests are measured using specific computer systems,components,software,operations,and functions.Any change to any of those factors may cause the results to vary.You should consult other information and performance tests to assist you in fully evaluating your con
10、templated purchases,including the performance of that product when combined with other products.Cornelis Networks technologies features and benefits depend on system configuration and may require enabled hardware,software,or service activation.Performance varies depending on system configuration.Cor
11、nelis,Cornelis Networks,Omni-Path,Omni-Path Express,and the Cornelis Networks logo belong to Cornelis Networks,Inc.Other names and brands may be claimed as the property of others.Copyright 2024,Cornelis Networks,Inc.All rights reserved.Cornelis Networks20242Notices and Disclaimers Supporting an Upst
12、ream First Kernel Driver for HPC Fabrics We covered the Why and the How We talked about what not to do and the consequences Cornelis Networks20243Last Year How do you do upstream first with new and in development HW?We are figuring that out!What are we actually doing?What is this CN5000 thing?Challe
13、nges Not A Sales Pitch!We got people for that,I can put you in touch if you want The technology should stand for itself My goal:Get the code upstream for when HW is available Cornelis Networks20244This Year Cornelis Networks20245 CN5000 is next generation fabric solution Consists of adapters,switche
14、s,cables,and software We will focus on adapter and software for it Why?Thats what the host sees SW is what people interact with(frankly its what I know!)What is CN5000?Edge Switch 48 Ports 400Gbps Air,hybrid,liquid cooling available Director Switch 576 Ports 400Gbps Air and liquid cooling availableh
15、ttps:/ Networks20246Switches The really cool stuff!PCIe Gen 5 Low profile Smaller is better.Have you tried to cram a GPU into a server?Air or indirect liquid using heat pipe from ASIC to server cold plate 1 or 2 Fabric ports 400Gbps(OPA-100 is 1 port 100Gbps)Cornelis Networks20247Adapters Host Fabri
16、c Interface(HFI)OPA-100 adapter ASIC is known as Wolf River(WFR)Name leaked from Intel long ago old news CN5000 adapter ASIC is known as Jackal River(JKR)Flat out telling you what it is because its in the code and code names dont matter Continue to take advantage of 16 DMA Engines DMA Engines bring
17、data into the card avoiding CPU copy These are the large data transfers PIO(Programmed IO)capability increased 160 contexts available in WFR 240 contexts available in JKR Memory increased 1MB to 4MB Full 16B packet type support in HW,as well as 9B WFR only supported 9B in HW 16B enables adaptive rou
18、ting,larger LIDs(24bit vs 16)PKey table increased from 16 to 1024 Needed for MLS SELinux support Cornelis Networks20248Technical Details Shouldnt this be secret?-Simply put,NO Everything I mention here is or will be obvious in the code Makes Open Source acceptance of the code easier.maybe?Open Sourc
19、e community can contribute more readily Settings/tweaks available in our tuning guide Honestly its just really cool stuff nerds like to hear about If you are bored now,just wait till we talk about code next!Cornelis Networks20249Why am I telling you this?Receive descriptors(how we land packets)have
20、increased From 65536 in WFR to 131072 in JKR Supports 8 VLs for data plus VL15 for mgmt PCIe SR-IOV support(lots of code changes coming for this!)Dual loopback ports for SI to SI packet communication Integrated CPORT processor Handles fabric mgmt(MAD packets in FW now,not in SW!)Receive Side Matchin
21、g Deeper packet inspection More rules,from 4 to 32 Lots of ideas floating around in my head for these!Your ideas welcome too!What could you do with them?Lots of other bells and whistles in the HW!Cornelis Networks202410Where were we?More details.Is this going to be hfi2?Discussed with maintainer las
22、t time OFA was in person response was:“Please God NO!”I agree so it will be hfi1 stillJKR is based on a lot of the same concepts as WFR Do we really need the 1?Not really but why bother Plan is to delete qibFew known users of qib left in the wildThey keep popping out of the woodwork periodically tho
23、ugh!Product has long been End of LifeDelete qib as part of JKR upstreaming Cornelis Networks202411What is the Upstream Plan?Software verbs implementation Presented to OFA a number of years ago Solved code duplication between hfi1 and qib With qib gone and hfi1 supporting both JKR and WFR do we still
24、 need rdmavt?Technically,NO However,no plans to remove it and collapse back into hfi1 Maybe someday if we run out of other things to do Invisible to application writers Cornelis Networks202412RdmavtWhile JKR and WFR are sort of similar a large number of HW registers have changedA large number of pat
25、ches that“parameterize”somethingThen a follow-on patch to use that parameterMight squash,well seeAdded some new header files to separate WFR and JKR registersNot taking the qib approach of function pointers for each chip Cornelis Networks202413Major Changes:Registers movingcommit b54fe09adca54571783
26、20bf95ec46175164e6285Author:Dean Luick Date:Thu Oct 19 10:45:01 2023-0400 RDMA/hfi1:Parameterize PIO init register The PIO init register will move in the new hardware.Make it a parameter.Signed-off-by:Dean Luick commit 6e8aa91b07f0ccbc86f33924062cb87f7983f3cbAuthor:Dean Luick Date:Thu Oct 19 12:45:1
27、7 2023-0400 RDMA/hfi1:Add JKR PIO init register support Add the JKR PIO init register.Signed-off-by:Dean Luick We tried our best in WFR to be extensible to multiple ports Number of places we decided to use per_port_data0 blindly Need to propagate the port number through multiple layers of code Not t
28、echnically difficult just tedious and a lot of code churn 7 years of code being moved around and modified hfi1 has a per device data structure the dd It also has a per port data structure even for the single port case the ppd Moved a number of fields from the dd into the ppd Cornelis Networks202414M
29、ajor Changes:Multiple Ports WFR had extensive handling for MADs MAD processing is moving to CPORT Enables more fabric security Takes burden off of the driver,free to handle other packets Driver and CPORT exchange information through a register interface Can also talk over the loopback ports for high
30、speed communication CPORT manages things like Link Status and LIDs,and PKeys,etc.Driver still needs to know when things change Driver has to be able to hand CPORT MAD packets from umad Cornelis Networks202415Major Changes:CPORT Technically no reason it has to change for JKR Mostly:Needs to know what
31、 port to use,it does not currently We agreed long ago that the dreaded Cdev was on its way out Heavy upstream resistance to any changes in our user API until Cdev gone Likely plan:Use uverbs FD for command and control We still need an FD for data pathFor SDMA submission,driver programs DMA enginesPI
32、O is kernel bypass Want to use io_uring()Some complications,open issues Need memreg for DMA Buff and GPUs Code is still a work in progress,look for an RFC soon Cornelis Networks202416Major Changes:User API Applies to WFR and OPA-100 as well as JKR and CN5000 Everyone knows the problem:GPU code not u
33、pstream Not going to harp on this,I hope we can all agree its just bad What do we do as an upstream first development organization?Other than hold our nose and just deal with it Embrace the distro kernels and drivers in particular Distro code+GPU=Our GitHub driver Distro code comes from Kernel.org K
34、ernel.org provides no good way to hook in non-upstreamable codePenalizes users in hopes they will push back on unfriendly-to-opensource vendorsHasnt happened,not going to happenThere are technical issues too how do you handle header files?License issues,etc.Point is instead of being obstinate we as
35、a kernel community COULD work the problem Cornelis Networks202417What about GPU?Cornelis Networks202418Picture is the same for GPU If we cant upstream GPU code we want to limit how much GPU code there is Turn out different GPUs and even our own ways of handling memory are similar Abstract out common
36、 code and“modularize”it Move system memory page pinning to its own file and create a nice interface Now we can“drop in”files for Nvidia and AMD that do their specific page pinning Makes it easy to add/remove GPU code and limits difference between upstream Have to monitor system pinning changes for u
37、pstream fixes and evaluate if needed in GPU Cornelis Networks202419GPU*Motivated*Changes Thats a step in the right direction Only 1 GPU supports it,maybe 2 someday?This is part of our motivation to use verbs device For memory registration,dmabuff support already exists dmabuff will be a drop in addi
38、tion like Nvidia and AMD codeExcept it will be upstream!Cornelis Networks202420So what about DMA Buff?opa-fm In-distro,will support CN5000 out of the box libpsm2 Active as long as OPA-100 is around OFI/OPX/libfabric Available for OPA-100 but really targeted for CN5000 Fast fabric tools In-distro,som
39、e are being revamped,more details soon Kernel Coming soon to a mailing list near you!Cornelis Networks202421Software High-level Status Kernel code is coming.Soon watch linux-rdma CN5000 HW coming later this year!We still have OPA-100 too!Lots of users Still fully supported Active developmentRecently added AMD GPURecently added a backwards compatibility shim for libpsm2 cuda Cornelis Networks202422Whats our overall status?Thank Y232024 Cornelis Networks