《HC2022.UniversityOfToronto.AmeerAbdelhadi.Slides.v00.pdf》由會員分享,可在線閱讀,更多相關《HC2022.UniversityOfToronto.AmeerAbdelhadi.Slides.v00.pdf(65頁珍藏版)》請在三個皮匠報告上搜索。
1、NOEMA:A Massive-Scale Brain Activity Decoding ChipAmeer Abdelhadi Eugene Sha Andreas MoshovosUniversity of TorontoAugust,20223434e.g.,1 second delayBrain Machine Interfaces(BMIs)BMIProcesses signals from the brain234e.g.,1 second delayrobotic arm movesBrain Machine Interfaces(BMIs)BMIProcesses signa
2、ls from the brain234Feedback to the braine.g.,1 second delayrobotic arm movesBrain Machine Interfaces(BMIs)BMIProcesses signals from the brain234BMIs at the edgeWhat if we can detect patterns of neuron activity in real-time?ApplicationsRepair brain functionInterface brain regions which no longer con
3、nect,e.g.AlzheimersReplacement of damagedhippocampus with a chip 11 https:/ repair)Detect,in real-time,memories,decisions,emotions,and experiences334BMIs at the edgeWhat if we can detect patterns of neuron activity in real-time?ApplicationsWoman controls robotic arm with 100-channel Utah array 2Repa
4、ir brain functionInterface brain regions which no longer connect,e.g.AlzheimersReplacement of damagedhippocampus with a chip 1Drive effectorsGreater accuracy and dexterity,e.g.robotic limbs1 https:/ repair)2 https:/continuum.utah.edu/web-exclusives/the-bionics-man/(Utah Array)Detect,in real-time,mem
5、ories,decisions,emotions,and experiences334BMIs at the edgeWhat if we can detect patterns of neuron activity in real-time?Detect,in real-time,memories,decisions,emotions,and experiencesApplicationsWoman controls robotic arm with 100-channel Utah array 2Repair brain functionInterface brain regions wh
6、ich no longer connect,e.g.AlzheimersReplacement of damagedhippocampus with a chip 1Drive effectorsGreater accuracy and dexterity,e.g.robotic limbsAnticipate and prevent harmful neural activitye.g.epilepsyResponsive neurostimulator system for epilepsy 31 https:/ repair)2 https:/continuum.utah.edu/web
7、-exclusives/the-bionics-man/(Utah Array)3 Critical review of the responsive neurostimulator system for epilepsy(Thomas and Jobst,2015)334The Challenge and OpportunityCapture Capability Growing ExponentiallyConstraints for a portable implanted deviceportable implanted device1.Fast(real-time,5ms detec
8、tion latency)2.Low-power&low-area3.ScalableSimultaneously Recorded NeuronsData from https:/stevenson.lab.uconn.edu/scaling/434Constraints for a portable implanted deviceportable implanted device1.Fast(real-time,5ms detection latency)2.Low-power&low-area3.ScalableThe Challenge and OpportunityCapture
9、Capability Growing ExponentiallySimultaneously Recorded NeuronsData from https:/stevenson.lab.uconn.edu/scaling/Existing solutions cant copeData quickly outpacing analysis techniques434Constraints for a portable implanted deviceportable implanted device1.Fast(real-time,5ms detection latency)2.Low-po
10、wer&low-area3.ScalableThe Challenge and OpportunityCapture Capability Growing ExponentiallySimultaneously Recorded NeuronsData from https:/stevenson.lab.uconn.edu/scaling/Limited number of neuronsNot real-timeHigh powerPhysically largeExisting solutions cant copeData quickly outpacing analysis techn
11、iques434Constraints for a portable implanted deviceportable implanted device1.Fast(real-time,5ms detection latency)2.Low-power&low-area3.ScalableThe Challenge and OpportunityCapture Capability Growing ExponentiallySimultaneously Recorded NeuronsData from https:/stevenson.lab.uconn.edu/scaling/Brain
12、activity decoding is memory intensive&computationally expensiveLimited number of neuronsNot real-timeHigh powerPhysically largeExisting solutions cant copeData quickly outpacing analysis techniques434Roadmap to NOEMAInput to the systemTemplate matching Baseline design&NoemaResults534The Raw Input Da
13、taNeural Neural probesprobes634The Raw Input DataWaveformsWaveformsNeural Neural probesprobes634The Raw Input DataWaveformsWaveformsNeural Neural probesprobes+1,000schannels634The Raw Input DataSpike Spike sortersorterWaveformsWaveformsNeural Neural probesprobes+1,000schannels634Processing PipelineS
14、pike Spike sortersorterWaveformsWaveforms+30,000neurons Indicator Indicator bitstreamsbitstreams0 0 0 0 1 1 0 0 0 0 00 0 1 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0neuron#1neuron#NNeural Neural probesprobes+1,000schannels634Processing PipelineSpike Spike sortersorterWaveformsWaveforms+30,000neurons Indic
15、ator Indicator bitstreamsbitstreams0 0 0 0 1 1 0 0 0 0 00 0 1 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0Neural Neural probesprobes+1,000schannelsTemplate MatchingTemplate Matching634Processing PipelineSpike Spike sortersorterWaveformsWaveforms+30,000neurons Indicator Indicator bitstreamsbitstreams0 0 0 0
16、1 1 0 0 0 0 00 0 1 1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 1 1 0Neural Neural probesprobes+1,000schannelsTemporal Temporal BinningBinningPearson Pearson Correlation Correlation CoefficientCoefficientTemplate Template MatchingMatching634Temporal BinningTemporal BinningTemporal BinningData“smoothing”neuron#1ne
17、uron#N(N)Number of neurons0 1 10 1 00 1 10 0 10 0 11 1 01 1 00 0 00 0 11 0 11 0 10 1 00 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0Divide incoming input into groups of 3(example bin size)Indicator Indicator bitstreamsbitstreams Time734Temporal BinningTemporal BinningTemporal B
18、inningData“smoothing”neuron#1neuron#N20 1 00 1 10 0 10 0 11 1 01 1 00 0 00 0 11 0 11 0 10 1 0Add the bits togetherIndicator Indicator bitstreamsbitstreams0 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTime734Temporal BinningTemporal BinningTemporal BinningDat
19、a“smoothing”neuron#1neuron#N212112201221Done for all bins in a templateIndicator Indicator bitstreamsbitstreams0 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTime734Temporal BinningTemporal BinningTemporal BinningData smoothing”neuron#1neuron#N2 1 21 1 22 0 1
20、2 2 1Indicator Indicator bitstreamsbitstreams0 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTime734Template MatchingTemporal BinningTemporal BinningTemplate MatchingTemplate MatchingPearson Correlation CoefficientPearson Correlation Coefficient0 1 1 0 1 0 0 1
21、 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 02 1 21 1 22 0 12 2 1Indicator Indicator bitstreamsbitstreamsTime834Template MatchingWhich template does the input most closely resemble?Templates#1#2Binary inputBinned input#3934Template MatchingHow do neuroscientists determine this?Templates#1#2B
22、inary inputBinned input#3934Pearson Correlation Coefficient(PCC)Widely used metric to measure Widely used metric to measure the“closeness”of two matricesthe“closeness”of two matrices1034PCC ExampleTemplates#1Moveright arm#2Moveleft armBinned input#3Moveleft leg1134PCC ExampleTemplatesBinned inputPCC
23、 scores(r)0.1350.8570.196#1Moveright arm#2Moveleft arm#3Moveleft leg1134PCC ExampleTemplatesBinned inputPCC scores(r)0.1350.8570.196#1Moveright arm#2Moveleft arm#3Moveleft leg1134Template Matching OverviewTemporal BinningTemporal BinningTemplate MatchingTemplate MatchingPearson Correlation Coefficie
24、ntPearson Correlation Coefficient0 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 02 1 21 1 22 0 12 2 1Indicator Indicator bitstreamsbitstreams0.1350.8570.1961234Entire input buffer fills before compute begins High latencyMost difficult requirement5ms for real-timeInput bufferCosts
25、 of baseline template matching design1334On-chip template memory Input bufferEntire input buffer fills before compute begins High latencyMost difficult requirement5ms for real-timeCosts of baseline template matching designStorage of input+templates Large memory coste.g.+1.24 Gb each1334On-chip templ
26、ate memory Input bufferStorage of input+templates Large memory coste.g.+1.24 Gb each+100k parallel PEsPEPEPEMany processing elements Large area costEntire input buffer fills before compute begins High latencyMost difficult requirement5ms for real-timeCosts of baseline template matching designPEPEPEP
27、EPEPEPEPEPEPEPEPEPEPEPEPEPEPE1334On-chip template memory Input bufferStorage of input+templates Large memory coste.g.+1.24 Gb each+100k parallel PEsPEPEPEMany processing elements Large area costEntire input buffer fills before compute begins High latencyMost difficult requirement5ms for real-timeCos
28、ts of baseline template matching designPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEHow can we do better?1334NOEMA MICRO21,Patented:Brain Interfaces at the EdgeA multidisciplinary collaboration effort in analyzing and developing a custom hardware platform to decipher the brain neural activity1434NOEMA MICRO2
29、1,Patented:Brain Interfaces at the EdgeEnabling truly portable systems for processing high-resolution brain activity signals for treatment,augmentation,and repair of brain functionsA multidisciplinary collaboration effort in analyzing and developing a custom hardware platform to decipher the brain n
30、eural activity1434NOEMA MICRO21,Patented:Brain Interfaces at the EdgeEnabling truly portable systems for processing high-resolution brain activity signals for treatment,augmentation,and repair of brain functionsA multidisciplinary collaboration effort in analyzing and developing a custom hardware pl
31、atform to decipher the brain neural activity Fabricated with TSMC 65nm GP technology Only 24sec latency!5 sec experience,1K neurons 0.73 mW Scales to 30K neurons,10more than have ever been recorded Scales to meet future demand!N O E M A s Prototype Chip1434Input Serialization&PCC Reformulation0 1 1
32、0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTime1534Input Serialization&PCC Reformulation0 1 1 0 1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTimeSerialize0 0 1 1 1 0 1 0 1 1 0 11534Input Serialization&PCC Reformulation0 1 1 0
33、1 0 0 1 10 0 1 0 0 1 1 1 01 1 0 0 0 0 0 0 11 0 1 1 0 1 0 1 0(N)Number of neuronsTimeSerialize0 0 1 1 1 0 1 0 1 1 0 1Reformulation1534NOEMAs innovations16S1PES2PES3PEPost ProcessingOn-chip template memory BitBit-seriaserial inputl inputNo buffering overheadCompute immediately when received34NOEMAs in
34、novations16S1PES2PES3PEPost ProcessingOn-chip template memory BitBit-seriaserial inputl inputNo buffering overheadCompute immediately when receivedNearNear-memory bitmemory bit-serial PEsserial PEsBased on reformulated PCCTiny,easy to scale34NOEMAs innovations16S1PES2PES3PEPost ProcessingOn-chip tem
35、plate memory BitBit-seriaserial inputl inputNo buffering overheadCompute immediately when receivedSimple memory compression(2.8x)NearNear-memory bitmemory bit-serial PEsserial PEsBased on reformulated PCCTiny,easy to scaleFits well with existing probe interfaces(time-multiplexed ADC out)34Baseline t
36、o NOEMA OverviewOn-chip template memory Input buffer+100k parallel PEsPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEPEBaselineNOEMA17S1PES2PES3PEPost ProcessingOn-chip template memory 34Performance Results11295286052150.0010.0010.010.11101001000100001000001000000CPU(RPi3B)CPU(i5 7000)GPU(GTX1080)PCCbaseli
37、neNoemaLatency(ms)Comparison of latency across platforms*For the most demanding configuration tested(9 sec experience,30K neurons)183411295286052150.0010.0010.010.11101001000100001000001000000CPU(RPi3B)CPU(i5 7000)GPU(GTX1080)PCCbaselineNoemaLatency(ms)Comparison of latency across platformsPerforman
38、ce ResultsReal-time constraint5ms*For the most demanding configuration tested(9 sec experience,30K neurons)193411295286052150.0010.0010.010.11101001000100001000001000000CPU(RPi3B)CPU(i5 7000)GPU(GTX1080)PCCbaselineNoemaLatency(ms)Comparison of latency across platformsPerformance ResultsCan continue
39、to scale upReal-time constraint5ms*For the most demanding configuration tested(9 sec experience,30K neurons)1934Power&Area Results17.31.202468101214161820PCCbaselineNoemaWattsPower consumption532.9205.40100200300400500600PCCbaselineNoemaSilicon Area(mm2)Area14x reduction2.6x reduction*For the most d
40、emanding configuration tested(9 sec experience,30K neurons)2034The NOEMA Family1.Duration of the decoded experience2.Resolution window of the incoming activities.Activities within this windows are binned(averaged).3.If executed on commodity hardware.4.Intels Stratix 10 FPGA5.TSMC 65nm GP6.Not applic
41、able;device cant meet target frequency.DeviceFmax(MHz)Neurons(thousands)TemplatesDuration1(seconds)Resolution2(milliseconds)Requirements3ImplementationCompute(GOPs)Memory(Mb)FPGA4ASIC5NOEMA01K1T05S250301152500.60.3 NOEMA10K2T05S005 30010255628.0 114.4 PlannedNOEMA20K3T09S250 600203925064.833.0O O6Pl
42、annedNOEMA30K4T09S005 90030495 6786.4 1236.0O O6Planned2134The NOEMA Family1.Duration of the decoded experience2.Resolution window of the incoming activities.Activities within this windows are binned(averaged).3.If executed on commodity hardware.4.Intels Stratix 10 FPGA5.TSMC 65nm GP6.Not applicable
43、;device cant meet target frequency.DeviceFmax(MHz)Neurons(thousands)TemplatesDuration1(seconds)Resolution2(milliseconds)Requirements3ImplementationCompute(GOPs)Memory(Mb)FPGA4ASIC5NOEMA01K1T05S250 301152500.60.3 NOEMA10K2T05S005 30010255628.0 114.4 PlannedNOEMA20K3T09S250 600203925064.833.0O O6Plann
44、edNOEMA30K4T09S005 90030495 6786.4 1236.0O O6Planned2134The NOEMA Family1.Duration of the decoded experience2.Resolution window of the incoming activities.Activities within this windows are binned(averaged).3.If executed on commodity hardware.4.Intels Stratix 10 FPGA5.TSMC 65nm GP6.Not applicable;de
45、vice cant meet target frequency.DeviceFmax(MHz)Neurons(thousands)TemplatesDuration1(seconds)Resolution2(milliseconds)Requirements3ImplementationCompute(GOPs)Memory(Mb)FPGA4ASIC5NOEMA01K1T05S250 301152500.60.3 NOEMA10K2T05S005 30010255628.0 114.4 PlannedNOEMA20K3T09S250 600203925064.833.0O O6PlannedN
46、OEMA30K4T09S005 90030495 6786.4 1236.0O O6Planned2134The NOEMA Family1.Duration of the decoded experience2.Resolution window of the incoming activities.Activities within this windows are binned(averaged).3.If executed on commodity hardware.4.Intels Stratix 10 FPGA5.TSMC 65nm GP6.Not applicable;devic
47、e cant meet target frequency.DeviceFmax(MHz)Neurons(thousands)TemplatesDuration1(seconds)Resolution2(milliseconds)Requirements3ImplementationCompute(GOPs)Memory(Mb)FPGA4ASIC5NOEMA01K1T05S250 301152500.60.3 NOEMA10K2T05S005 30010255628.0 114.4 PlannedNOEMA20K3T09S250 600203925064.833.0O O6PlannedNOEM
48、A30K4T09S005 90030495 6786.4 1236.0O O6Planned2134NOEMAs ASIC DevicesDeviceSilicon Area(mm2)Power(mW)Latency(s)ChipStatusMemory LogicTotalMemoryLogicTotalNOEMA01K1T05S2500.360.070.43*0.300.430.7323.9In lab+#NOEMA10K05S005MS28.461.3529.81*89.7884.28174.062.8Simulated#NOEMA20K09S250MS6.260.096.25*18.5
49、59.6828.231.5Simulated#NOEMA30K09S005MS202.003.42 205.42*682.70522.76 1205.461.0Simulated#*Core only;2.1mm2total silicon area.+Fabricated with TSMC 65nm GP#Also tested on Intels Stratix 10 FPGA2234NOEMAs ASIC DevicesDeviceSilicon Area(mm2)Power(mW)Latency(s)ChipStatusMemory LogicTotalMemoryLogicTota
50、lNOEMA01K1T05S2500.360.070.43*0.300.430.7323.9In lab+#NOEMA10K2T05S00528.461.3529.81*89.7884.28174.062.8Simulated#NOEMA20K09S250MS6.260.096.25*18.559.6828.231.5Simulated#NOEMA30K09S005MS202.003.42 205.42*682.70522.76 1205.461.0Simulated#*Core only;2.1mm2total silicon area.+Fabricated with TSMC 65nm
51、GP#Also tested on Intels Stratix 10 FPGA2234NOEMAs ASIC DevicesDeviceSilicon Area(mm2)Power(mW)Latency(s)ChipStatusMemory LogicTotalMemoryLogicTotalNOEMA01K1T05S2500.360.070.43*0.300.430.7323.9In lab+#NOEMA10K2T05S00528.461.3529.81*89.7884.28174.062.8Simulated#NOEMA20K3T09S2506.260.096.25*18.559.682
52、8.231.5Simulated#NOEMA30K09S005MS202.003.42 205.42*682.70522.76 1205.461.0Simulated#*Core only;2.1mm2total silicon area.+Fabricated with TSMC 65nm GP#Also tested on Intels Stratix 10 FPGA2234NOEMAs ASIC DevicesDeviceSilicon Area(mm2)Power(mW)Latency(s)ChipStatusMemory LogicTotalMemoryLogicTotalNOEMA
53、01K1T05S2500.360.070.43*0.300.430.7323.9In lab+#NOEMA10K2T05S00528.461.3529.81*89.7884.28174.062.8Simulated#NOEMA20K3T09S2506.260.096.25*18.559.6828.231.5Simulated#NOEMA30K4T09S005202.003.42 205.42*682.70522.76 1205.461.0Simulated#*Core only;2.1mm2total silicon area.+Fabricated with TSMC 65nm GP#Als
54、o tested on Intels Stratix 10 FPGA2234NOEMA01K05S250MS TSMC 65nm GP 24sec latency 1K neurons(scales to 30K)5sec experience Consumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency
55、2334NOEMA01K05S250MS TSMC 65nm GP 24sec latency 1K neurons(scales to 30K)5sec experience Consumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency2334NOEMA01K05S250MS TSMC 65nm GP
56、24sec latency 1K neurons(scales to 30K)5sec experience Consumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency2434NOEMA01K05S250MS TSMC 65nm GP 24sec latency 1K neurons(scales to
57、 30K)5sec experience Consumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency24TemplateRAM34NOEMA01K05S250MS TSMC 65nm GP 24sec latency 1K neurons(scales to 30K)5sec experience Co
58、nsumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency24TemplateRAMCompute RAMCompute RAMCompute RAM34NOEMA01K05S250MS TSMC 65nm GP 24sec latency 1K neurons(scales to 30K)5sec exp
59、erience Consumes 0.73mW Equivalent of600MOPs 32bit-FPBy Comparison:Nvidia Jetson NanoConsumes 10WBarely meets 5msreal-time latency Intel i5-700063mslatencyFails to meetreal-time latency24TemplateRAMCompute RAMCompute RAMCompute RAMLogic34NOEMAKey TakeawaysBrain machine interfaces:O Exponential growt
60、h in dataO Current solutions are not sufficientNOEMAs key innovation:Uses simple,low-cost,area-and energy efficient bit-serial and integer arithmetic units Enables computations to proceed progressively as data is received Scales to meet future demand 14x less power,2.6x smaller,order of sec latency25Thank you!