《黃瀛-PCP auto-tuning.pdf》由會員分享,可在線閱讀,更多相關《黃瀛-PCP auto-tuning.pdf(23頁珍藏版)》請在三個皮匠報告上搜索。
1、PCP Size Auto-tuning forPage Allocator ScalabilityHuang,Ying2024 MarchAgenda Problems and background Design and implementation Performance evaluation ConclusionProblems and BackgroundBuddy SystemLCPULCPULCPULCPULCPULCPUNodePCPPCPBuddyPCPPCPPCPPCPAlloc/free One buddy per zone(node)protected by one zo
2、ne lock-Scalability issue!One PCP(Per-CPU Pageset)per LCPU(logical CPU)Buddy System-Continued Physical memory management:node-zone One buddy(page allocator)per zone(per node in practice)All logical CPU of one NUMA node share one zone lock-Scalability issue!More and more cores in one NUMA node in the
3、 future PCP(Per-CPU pageset)can reduce zone lock contention Batching allocation/freeing Less allocation/freeing in zonePossible Solution 1:Fake NUMA NodeLCPULCPULCPULCPULCPULCPUNodeBuddy 0Buddy 1Buddy 2Fake node 0PreferFake node 1Fake Node 2 Very good scalability:Zone,reclaim,compaction,etc.Easy to
4、implement More management burdenPossible Solution 2:Splitting BuddyLCPULCPULCPULCPULCPULCPUNodeBuddy0Buddy1Buddy2ZonePrefer Refused by community for now https:/lore.kernel.org/linux-mm/20230511065607.37407-1- be revisited in the future if necessaryRegion0Region1Region2Possible Solution 3:Larger Allo
5、cation Unit Large folios:Smaller cache footprint with zone lock held0.005.0010.0015.0020.0025.0030.0035.0040.0045.000102030405060Will-it-scale page allocate/free throughput(GB/s)on ICX-SPorder=2order=0Possible Solution 4:PCP Auto-tuningLCPULCPULCPULCPULCPULCPUNodePCPPCPBuddyPCPPCPPCPPCPAlloc/free La
6、rger PCP in effect:less allocation/freeing in buddy Auto-tune:as large as requiredAutoAutoAutoAutoAutoAllocation Patterns-Kbuild Pattern 1:amplitude 100,period 1s Original PCP Pattern 2:amplitude 25k,period:0.5s-1s Auto-tuned PCP high Pattern 3:amplitude=100k,period=10s Not covered by PCPDesign and
7、ImplementationTuning target Minimize allocation/freeing from/to shared buddy Reduce zone lock contention as much as possible Minimize idle pages in PCP Reduce memory wastage Minimize pages in PCP if system free pages are too few Maximize memory utilityIncrease PCP High on Demand Increase PCP high by
8、 batch when allocate from buddy If we had larger PCP high,we could have avoid allocating from buddy Dont change PCP high when free to buddy Larger PCP high doesnt help pure freeingHighNew highAllocateHighPCPFreePCPDecrease PCP High Periodically Decrease PCP high periodically 1/8 per second(use vmsta
9、t updater worker)Free pages if necessary Try to free idle pages if anyHighNew highPCPFreeMinimize PCP High When Free Page Runs Low Detect free pages low via monitoring watermark and kswapd running Decrease PCP high until minimal Free pages in PCP to be used by workloadsHighMin highPCPFreePerformance
10、 EvaluationTest Configuration Machine 2 socket SPR SP system with 112 cores 256 GB DRAM Kernel config THP:disabled Benchmark:Multi-instance kbuild Run 8 kbuild instances,each in a separate cgroup Increase parallel,eliminate influence of LRU lock Simulate build server workloadTest Results base optimi
11、zed -Build time(s)650.55 617.77native_queued_spin_lock_slowpath%14.04 0.95Buddy allocation(normalized)100.00 20.00 Reduced build time up to 5%Reduces zone lock contention effectively Via reducing allocating/freeing from/to buddyKbuild Tuning Details PCP high increases during test PCP high decreases
12、to minimal finally PCP high decreases periodicallyKbuild Memory Used More memory used during kbuild run for enlarged per-CPU pageset Memory used reduces to normal finally Memory wastage will not last for longConclusionsConclusions PCP high auto-tuning can reduce zone lock contention effectively via reducing page allocating/freeing from/to buddy system PCP high auto-tuning can control memory wastage effectivelyvia lowering PCP high periodicallyThanks!