1、Building Hardware-Accelerated Networking Applications on SmartNICsSession B-101:Data Center ApplicationsJohn W Lockwood,CEO&John Hagerman,VP Algo-Logic Systems,Inc.2pm:April 27,2022San Jose,CA April 26-28,2022Gateware Defined NetworkingAlgo-Logics FPGA/SmartNIC Development Framework Software on CPUL
2、ogic on FPGABusiness Logic in C/C+in gateware in FPGA with HLSAlgo-Logic TCP Networking StackTCPULLMACUDPOffloadUDPINTELHSSIExact Match Lookup Engine(EMSE)FPGA registersP C IULL MACAlgo-Logic IP CoresCard Vendor Hardware+Software DriverCustomer Software and Business Logic Key-Value Table APIEthernet
3、Customers Existing C/C+SoftwareKERNELBYPASSDRIVERLOWLATENCYDMA2San Jose,CA April 26-28,2022Cisco Nexus SmartNIC+V5P/V9PUltrascale+FPGAOn+Off-chip SRAMDRAM 8 x 10G/25G EthernetIntel Programmable Acceleration Card:PAC D5005Stratix 10 FPGAOn-chip SRAMDRAM 8 x 10G/25G EthernetXilinx ALVEO U50/U200/U250U
4、ltrascale+FPGAOn-chip SRAMDRAMHBM8 x 10G/25G EthernetAlgo-Logic provides algorithms that run in logic on multiple FPGA partner platforms In-Memory KVS systems are used widely in the cloudAmazon DynamoDB Used for shopping carts&active session store(profile,messages,target promotions)Milliseconds of l
5、atency to retrieve small values(400 KB)Facebook RocksDBUsed to track the state of users,graph search,and cache for HadoopEmbedded database for key-value data written in C/C+using RAM and FlashMicrosoft FASTER“Managing large application state easily,resiliently,and with high performance is one of the
6、 hardest problems in the cloud today”RedisPortable across all cloud providers and available for on-premise deployments Open-source code base with professional support Motivation for Key Value Store(KVS)in the CloudAlgo-Logics Network-Attached KVS in FPGA LogicSee Also:Algo-Logic GDN Search(Key Value
7、 Store)HiREDIS C/C+API for Ethernet-attached Compute ClientsA network-attached 1U rack server with CPUs&FPGAs provides massive ThroughputDell/CCI PowerEdgeR6525 1U ServerTop-of-Rack 10G/25G Ethernet Switch 1 Solarflare Mezzanine NIC3 Xilinx ALVEO U50 cards with Ultrascale+FPGAs Algo-Logic gateware f
8、or KVS in FPGA40 GbpsQSFP+50 Gbps2xSFP2840 GbpsQSFP+40 GbpsQSFP+2 AMD CPUs on Motherboard for running REDIS software256 GB of DDR4 SDRAMClient Software:HiREDIS C/C+API Cloud Onload and C/C+API modeled on HiRedis10/25/40/100 GbpsEthernet SwitchKey Value Store in 1U Server 40*3+50=120+50=170 Gbps band
9、width to KVS tables40 GbpsQSFP+50 Gbps2xSFP2840 GbpsQSFP+40 GbpsQSFP+Compute client software running in user-spaceScale-out compute servers 10G,25G,50G,or 100Gto each compute client Compute client software running in user-spaceScale-out compute servers Compute client software running in user-spaceSc
10、ale-out compute servers Compute client software running in user-spaceScale-out compute servers Details of the Dell/CCI PowerEdge R6525 1U Rack ServerTwo AMD EPYC 7402 24-core CPUs(96-way multi-threaded)256 GB of ECC DRAM using 16 DDR4 DIMMsThree half-height slot with Xilinx U50 FPGA cards with Ultra
11、RAMOne Mezzanine slot with Solarflare Cloud Onload NIC San Jose,CA April 26-28,2022Competitive Advantage of Algo-Logic GatewareTraditional network software(Sockets on Linux):high latency and large jitterFPGA gateware:lowest latency,no jitter Kernel bypass software:lowers latency,still has jitter Low
12、er Latency=Best Speed Tightest Spread=Least JitterTotal Throughput in 1U Rackmount Server 3*150M IOPs from FPGA Key Value StoreImplemented on 3 Xilinx ALVEO U50 CardsEach U50 card fits in a Half-High PCIe slot.Connected with 4*10 Gigabit Ethernet Ports 2*20M IOPs from Redis in Software on Dell AMD S
13、erver Using Dual-port Solarflare NIC on Mezzanine card Each Mezzanine card has 2*25 Gigabit Ethernet Combined1U server provides 450M+40M=490M IOPs1.75”Tall and 19”wideKey Outcomes San Jose,CA April 26-28,2022Algo-Logic Framework for FPGA-Accelerated TradingIntel Xeon CPUFPGA CardBusiness Logic in C/
14、C+in gateware in FPGA with Intel HLSAlgo-Logic TCP Networking Stackbundled with Intel D5005TCPULLMACUDPOffloadUDPINTELHSSIExact Match Lookup Engine(EMSE)FPGA registersPCIeINTELKERNELBYPASSDRIVERINTELLOWLATENCYDMAULL MACAlgo-Logic IP CoresIntel Hardware+Software DriverCustomer Software and Business L
15、ogic INTELLOWLATENCYSOCKETAPIKey-Value Table APIStock,Option,Future,CryptoExchangeCustomers Existing Trade Software Order Management System(OMS)C/C+in SoftwareOrderOrderTCPOrderTCPENetMarket DataUDPENetMarket DataUDPMarket DataSan Jose,CA April 26-28,2022Ultra Low Latency Networking FPGA Platform PC
16、Ie card with FPGA Fast Data Mover/kernel-bypass NICHigh Level Synthesis(HLS)Algo-Logic ProvidesUltra-Low Latency MACsUDP/TCP Endpoints in LogicCut-through data processing APIs for C/C+software apps Ideal Solution forHigh-speed Trading Pre-Trade Risk Checks Trading Gateways San Jose,CA April 26-28,20
17、22Round-Trip Application LatencyOther 3rd party NICKernel BypassDefault FirmwareLow Latency FirmwareThis Host+Network Software APIIntel Data Mover Algo-Logic MAC+TCPAlgo-Logic FPGA LogicULL MAC,UDP,TCPCut-through processingDeterministic latencyHLS Interface for algorithms in logic coded in C/C+Nanos
18、econds Microseconds.tens to hundreds of Microseconds Milliseconds Percent of Packes with that LatencySan Jose,CA April 26-28,2022Round-Trip Application LatencyNICKernel BypassDefault FirmwareLow Latency FirmwareHost+Network Software APIData Mover Algo-Logic MAC+TCPAlgo-Logic FPGA LogicULL MACTCP End
19、pointHLS Interface for trading algorithms in logicFPGASW+NICT2T in FPGAPTRC in FPGAT2T in SW with NICPTRC in SW with NICNanoseconds Microseconds.tens to hundreds of Microseconds Milliseconds Percent of Packes with that LatencyTrading Solutions:Latency vs.Development EffortLatency AwareSoftware only
20、CPU/Foundational NICBanks,Mutual Funds,Hedge Funds Latency CriticalPure Hardware FPGA or Custom ASICHigh Frequency TradersRound Trip Latency(Micro-Seconds)uS1uS10uS+Time Latency SensitiveSoftware&Hardware acceleratedCPU/FPGA/FSI SmartNICInvestment Banks/Exchanges Automated Trading Development Time Y
21、earsMonthsNeverHoursDaysWeeksmssecSecnsConclusionsAlgorithms in Logic Enable Ultra-Low Latency Applications Start with pre-built components for networkingLibrary of IP cores developed in VerilogULL MACUDP/IPTCP/IPObject Store(Key Value Store)Protocol parsers for Tick-to-Trade Customize application i
22、n HLS Environment Write C/C+Code Compile Application to Logic with Vendor tools(Vitis,Quartus,)Synthesize with IP cores in Framework Algo-Logic is Hiring Staff located entirely here in San Jose,CAHiring for multiple rolesSan Jose,CA April 26-28,2022Key PointsBy mapping algorithms into logic,network
23、applications can be implemented that perform full-stack processing functions with very low latency.Whereas applications written in software software typically require multiple microseconds to complete application-level functions,applications implemented in Field Programmable Gate Array(FPGA)logic pr
24、ocess packets of data on the timescale of nanoseconds.A challenge in the past has been that the time to develop applications was typically measured in years.But today,with the help of High Level Synthesis(HLS)compilers and pre-built ultra-low-latency IP cores for networking,Algo-Logic provides complete frameworks for deploying applications for high-frequency trading,database,and other real-time applications on multiple off-the-shelf FPGA platforms from Cisco,Xilinx,and Intel.