《1C-202_The nanoPU A Nanosecond Network Stack for Datacenters.PPTX》由會員分享,可在線閱讀,更多相關《1C-202_The nanoPU A Nanosecond Network Stack for Datacenters.PPTX(13頁珍藏版)》請在三個皮匠報告上搜索。
1、The nanoPUnanoPU:A Nanosecond Network Stackfor DatacentersStephen Ibanez,Alex Mallery,Serhat Arslan,Theo Jepsen,*Muhammad Shahbaz,Changhoon Kim,Nick McKeownStanford University,*Purdue Universitywww.SmartNICsS San Jose,CA April 26-28,2022The Need to Minimize RPC Latency and Software OverheadsLarge On
2、line Interactive Services Web Search Recommendation systems Online transaction processing2Fine-grained Computing Video encoding(ExCamera NSDI17)Object classification(Sprocket SoCC18)Software compilation(gg ATC19)MapReduce-style analytics(Locus NSDI19)Flash Bursts(NSDI 21).RPCQuestion:What would it t
3、ake to absolutely minimize RPC median and tail latency as well as software processing overheads?www.SmartNICsS San Jose,CA April 26-28,2022Previous Approaches are Insufficient3ApproachLimitationWire-to-Wire LatencyRPC ThroughputDataplane operating systems(e.g.Shinjuku,Shenango)Too coarse grainedMedi
4、an:2-5sTail:10-100s100MrpsRDMA NICsNeed low latency to remote compute,not memoryMedian:700nsN/AIntegrated NICs(e.g.NeBuLa)Still room for improvement of latency and throughputMedian:100nsTail:2-5s20Mrps/corewww.SmartNICsS San Jose,CA April 26-28,2022The nanoPU4dProgrammable NICKey Features:Integrated
5、 NIC Efficient core selection in HW Programmable transport in HW Direct path to CPU register file Hardware-accelerated thread schedulingLLCMain MemoryCore 0Core N-1DMA PathHWTransportCoreSelectionWire-to-wire latency:69nsSingle-core throughput:118Mrpswww.SmartNICsS San Jose,CA April 26-28,2022The na
6、noPU Core5HW Thread SRXnetTXRegistersL1 I$CoreL1 D$RX QueueTX QueueMVSwapwww.SmartNICsS San Jose,CA April 26-28,2022The nanoPU Core6HW Thread SRXnetTXRegistersL1 I$CoreL1 D$RX QueuesTX QueuesMVP=1P=0www.SmartNICsS San Jose,CA April 26-28,2022The nanoPU Fast Path7dPISAIngressEgressEthernet MAC+Serial
7、 IOProgrammable NICCore 0HW Thread SRXnetTXRX/TXQsRegistersReassemblyMessage BufferHW TransportPacketizationMessage BufferPktsMsgsCore N-1HW Thread SRXnetTXRX/TXQsRegistersHW Core Sel.Global RXQsGlobal TXQswww.SmartNICsS San Jose,CA April 26-28,2022nanoPU Prototype Quad-core nanoPU based on open sou
8、rce RISC-V Rocket core 4,300 lines of Chisel code&1,200 lines of C and RISC-V assembly for custom nanokernel Implements NDP transport Cycle-accurate simulations(3.2GHz)on AWS FPGAs using Firesim(ISCA 18)8Program m abl e 200Gb/s N I CEthernetMAC+Serial IORocket Cores 0 to 3HW Pri ori ty Thread SRXnet
9、TXRegistersLocal RX/TXQsL1CMain MemoryRI SC-V CPUL2CApp reads a messageApp writes a message26ns5.3ns2.2ns3.1ns0.9ns0.9ns0.6ns26nsPISAPipelinePacketMessageGlobal RX/TXQ sHW N DP TransportLoopback Latency=13nsWire-to-Wire Latency=65nsHW JBSQCore Sel.www.SmartNICsS San Jose,CA April 26-28,2022Microbenc
10、hmarks9Wire-to-WireLatency(ns)Single Core LoopbackThroughput(Mrps)nanoPU69118IceNIC10316eRPC85010Program m abl e N I CEthernetMAC+Serial IORocket Cores 0 to 3HW Pri ori ty Thread SRXnetTXRegistersLocal RX/TXQsL1CMain MemoryRI SC-V CPUL2CApp reads a messageApp writes a message26ns5.3ns2.2ns3.1ns0.9ns
11、0.9ns4.6ns26nsPISAPipelinePacketMessageGlobal RX/TXQ sHW N DP TransportLoopback Latency=17nsWire-to-Wire Latency=69nsHW JBSQCore Sel.Ethernet MACSplitterArbiterDecryptEncryptPISA PipelineTransportCor e0ContextPkt OutMsg OutPkt InMsg InThr eadSchedul erNI C PacketDat apat hNI C M essageDat apat hEthe
12、rnet MACSplitterArbiterEncryptDecryptMAU PipelineTransportEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUhea
13、dtailRegistersFIFOsCPU0CPU1CPU2Ethernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsControlPkt OutMsg OutPkt InMsg InThreadSchedulerN I CDat apat hN I C-Cor eI nt erf aceEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportEthernet MACSplitterArbiterEncrypt
14、DecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsCPU0CPU1CPU2Ethernet MACSplitterArbiterEncryptDecryptMAU PipelineTranspor
15、tEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsEthernet MACSplitterArbiterEncryptDecryptMAU PipelineTransportCPUheadtailRegistersFIFOsCPU0CPU1CPU2Cor e1Cor e2CPUs wi t hNan
16、oker nel&Nanot asksContext FIFOs(a)(b)(c)(d)ContextCSRsL-NI CCSRsnanoPU HW Thread Scheduling Reduces tail latency by 4-6x 20%higher loadwww.SmartNICsS San Jose,CA April 26-28,2022nanoPU Applications MICA Key-Value Store:10592ns Raft Consensus,Chain Replication,Set Algebra,and more!www.SmartNICsS San
17、 Jose,CA April 26-28,2022Programmable One-Sided RDMAOperationsState-of-the-art RDMA NIC Implemented in NIC HW End-to-end latency:2s11300nsclientserverThe nanoPU Implemented in SW Can support arbitrary one-sided operationswww.SmartNICsS San Jose,CA April 26-28,2022nanoPU ConclusionsKey Takeaway:To tr
18、uly minimize median and tail RPC latency and SW overheads:1.Fast path directly between network and CPU register file2.Move entire network stack to HW:transport,load balancing,thread scheduling12Challenges:Need to rewrite applications Figure out how to use more sophisticated processorswww.SmartNICsS San Jose,CA April 26-28,2022Thank You!13Contact Email: