《基準數據和人工智能平臺:如何選擇和使用好的基準測試(重復).pdf》由會員分享,可在線閱讀,更多相關《基準數據和人工智能平臺:如何選擇和使用好的基準測試(重復).pdf(45頁珍藏版)》請在三個皮匠報告上搜索。
1、2024 Databricks Inc.All rights reserved2Benchmarking Benchmarking Data and AI Data and AI PlatformsPlatformsShannon Barrow,Lead Solutions Architect,DatabricksShannon Barrow,Lead Solutions Architect,DatabricksJoe Harris,Sr Software Engineer,DatabricksJoe Harris,Sr Software Engineer,Databricks2024 Dat
2、abricks Inc.All rights reservedBenchmarking Data and AI PlatformsBenchmarking Data and AI Platforms3Joe HarrisJoe Harris,Sr Software EngineerSr Software EngineerShannon Barrow,Lead Solutions ArchitectShannon Barrow,Lead Solutions Architect2024 Databricks Inc.All rights reservedBenchmarking Data and
3、AI PlatformsBenchmarking Data and AI Platforms4Shannon Barrow,Lead Solutions ArchitectShannon Barrow,Lead Solutions ArchitectHow Much You Bench,Bro?How Much You Bench,Bro?Joined Databricks in March 2019Previously:Principal-Innovation and Thought Leadership,Accenture Applied AnalyticsDespite overarch
4、ing benchmark discussion,I may:Put extra focus on TPC-DIPut on my Databricks hat for short segments2024 Databricks Inc.All rights reservedPrimarily TPC but others will be mentionedSuggestion:view the following benchmarks through the lens of a full endfull end-toto-end Lakehouse architectureend Lakeh
5、ouse architectureHow can an organization get a“full picture”of an end-to-end TCO?Highlights and challengesFocus on Gen AIWhatWhat to benchmark?Lessons learned from MosaicLakehouse/OLAPLakehouse/OLAPML/Gen AIML/Gen AI5Todays ScopeTodays ScopeFocus on LAKEHOUSE and AI Related BenchmarksFocus on LAKEHO
6、USE and AI Related Benchmarks2024 Databricks Inc.All rights reservedWhy Benchmark?Why Benchmark?6LevelLevel-setting on the Value and Limitations of Benchmarkssetting on the Value and Limitations of Benchmarkshttps:/ Databricks Inc.All rights reservedWhy Benchmark?Why Benchmark?7LevelLevel-setting on
7、 the Value and Limitations of Benchmarkssetting on the Value and Limitations of BenchmarksStandardization and repeatabilityTo conform to the same practicesTo conform to common industry operations,use cases,input/output,and scaleIndustry“agreed upon”testing heuristics“Official”submissionsPotential fo
8、r:CheatingBiasAbuseSlow pace of modernizationLevel playing field for all platformsLevel playing field for all platformsCan be hard to believe any resultsCan be hard to believe any results2024 Databricks Inc.All rights reserved8Lakehouse Lakehouse BenchmarkingBenchmarking2024 Databricks Inc.All right
9、s reservedActive Benchmarks Per TPC Website:TP(P)C TP(P)C The Ubiquitous StandardThe Ubiquitous Standard9Most prevalent and wellMost prevalent and well-knownknownT Transaction P Processing P Performance C CouncilFormed in 1988Benchmarks across multiple domainsDecision Support(OLAP)TPC-DI,TPC-H,TPC-D
10、STransaction Processing(OLTP)TPC-C,TPC-H“Big Data”TPC-HS,TPC-BBVirtualizationTPC-V,TPC-HCIInternet of Things(TPC-IOT)AI(TPC-AI)These are the only ones in scope today2024 Databricks Inc.All rights reservedSSB(Star Schema Benchmark)Designed to measure the performance of databases in a star schema setu
11、pSimpler than TPC benchmarks but focused on specific aspects of OLAP queryingClickBench,The No-Join BenchmarkFocuses on workloads without joinsSimulates scenarios common in clickstream analyticsOther OLAP BenchmarksOther OLAP Benchmarks10Are TPC Benchmarks The Only Game in Town?Are TPC Benchmarks Th
12、e Only Game in Town?2024 Databricks Inc.All rights reservedThere is no SQL consumption in TPC-DINo transformations in TPC-H or TPC-DSMost“unofficial”results even skip the data loading step altogetherLakehouse Focus Lakehouse Focus-10k Foot View 10k Foot View 11TPC fragments the Lakehouse Architectur
13、e into separate benchmarksTPC fragments the Lakehouse Architecture into separate benchmarksTPC-DITPC-HTPC-DS2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedETL:TPCETL:TPC-DIDI122024 Databricks Inc.All rights reservedZEROZERO official submissionsWas Databricks first to c
14、ode it?I originally presented completedcompletedbenchmark at DAIS 2022Not submittedsubmitted(not for lack of trying)Extremely short TL;DR:ZERO code givenIngest:TXT,CSV,XMLTransform:based upon 100+pages of business rulesLoad:all 3 medallion layersTPCTPC-DI:Data IntegrationDI:Data Integration13The Ing
15、estion and ETL OneThe Ingestion and ETL One2024 Databricks Inc.All rights reservedThe The bestbest official ETL benchmark availableofficial ETL benchmark availableThe The worstworst official ETL benchmark official ETL benchmark availableavailable14Is the TPCIs the TPC-DI Valuable?DI Valuable?A Frust
16、rating Benchmark that hides some real valuable insightA Frustrating Benchmark that hides some real valuable insight2024 Databricks Inc.All rights reservedThe The bestbest official ETL benchmark availableofficial ETL benchmark availableThe The worstworst official ETL benchmark official ETL benchmark
17、availableavailable15Is the TPCIs the TPC-DI Valuable?DI Valuable?A Frustrating Benchmark that hides some real valuable insightA Frustrating Benchmark that hides some real valuable insightRobust even though built for legacy DWsBusiness rules make for realistic testThough it suffers DQ issues with dat
18、a generator at higher scale factorsFlexibility in how rules are codedAllows practitioners to optimize to their platformIs anyone aware of another“official”ETL benchmark?No”official”submittalsScoring metrics are confusing and do not even allow for cloud platformsNo provided code means it is extremely
19、 frustrating to attempt this benchmarkMade worse by long,confusing business rules2024 Databricks Inc.All rights reservedWhy?Why?A)Because meA)Because me2)Because everybody2)Because everybodyD)Because Joe Abandoned UsD)Because Joe Abandoned UsExcuse Me While I DigressExcuse Me While I Digress16I will
20、 speak longer on TPCI will speak longer on TPC-DI than originally plannedDI than originally planned2024 Databricks Inc.All rights reservedThis is a slide from the DAIS 2022 DAIS 2022 Session in which we announced the TPC-DI had been finally implementedPhoton price per billion rows:$1.51$1.512 Years
21、of TPC2 Years of TPC-DI on DatabricksDI on Databricks17From Initial Implementation to Scorching Performance TodayFrom Initial Implementation to Scorching Performance Today2024 Databricks Inc.All rights reservedIn April 2023April 2023,we published a blog,How We Performed ETL on One Billion Records Fo
22、r Under a Dollar,to tout the power and TCO of Delta Live Tables Delta Live Tables on this benchmark.Photon price per billion rows:$0.96$0.962 Years of TPC2 Years of TPC-DI on DatabricksDI on Databricks18From Initial Implementation to Scorching Performance TodayFrom Initial Implementation to Scorchin
23、g Performance Today2024 Databricks Inc.All rights reservedThis video compiled in September 2023 September 2023 compares a dbtdbtimplementation against CDW competitorsPhoton price per billion rows:$0.73$0.732 Years of TPC2 Years of TPC-DI on DatabricksDI on Databricks19From Initial Implementation to
24、Scorching Performance TodayFrom Initial Implementation to Scorching Performance Today2024 Databricks Inc.All rights reservedA Prominent CDW Was missingA Prominent CDW Was missing20The Truth is Out ThereThe Truth is Out There2024 Databricks Inc.All rights reservedA Prominent CDW Was missingA Prominen
25、t CDW Was missing21Some can handle large file sizes,others cantSome can handle large file sizes,others cantWe tried benchmarking the other CDW other CDW but found it wholly intractable at larger scale factors since it is the only one that is unable to split raw files nativelyWe werent the only ones
26、to notice2024 Databricks Inc.All rights reservedLast month(May 2024)we expanded the benchmark to test non-DWs which required moving from dbtSince AWS has been improving EMR over the last few years this became the obvious first choice for non-dbt tests2 Years of TPC2 Years of TPC-DI on DatabricksDI o
27、n Databricks22From Initial Implementation to Scorching Performance TodayFrom Initial Implementation to Scorching Performance Today9 Graviton16-core workers=144 cores2024 Databricks Inc.All rights reserved2.2x faster on the cores 2.2x faster on the cores as 2 years ago!24 minutes down to 10.75 minute
28、s576 cores down to 144 coresImprovements from:PHOTON shifting into overdriveoverdriveGradual code and orchestration improvementsNo code is provided-optimize code to match the platformNewer generation VMsOther platform enhancements1 year ago:“A billion rows for under a dollar”Today:as low as 2020 on
29、spot(27 on-2 Years of TPC2 Years of TPC-DI on DatabricksDI on Databricks23From Initial Implementation to Scorching Performance TodayFrom Initial Implementation to Scorching Performance Today$-$0.20$0.40$0.60$0.80$1.00$1.20$1.40$1.60Jun-22Apr-23Sep-23May-24Price per Billion Rows Databricks Digression
30、 over 2024 Databricks Inc.All rights reserved2024 Databricks Inc.All rights reservedSQL:SQL:TPCTPC-H&TPCH&TPC-DS DS 262024 Databricks Inc.All rights reservedReleased in April 1999 to“fix”issues with TPC-DHowever,the following year TPC moved to develop a new decision support benchmark to better refle
31、ct modern OLAP implementationsIn January 2012,the TPC-DS was released begging the question why the TPC-H is still used by organizationsSo TPCSo TPC-H this still a thing?H this still a thing?27The“OG”OLAP benchmarkThe“OG”OLAP benchmark2024 Databricks Inc.All rights reservedInmon-style DW model with 1
32、 very large table(lineitem)and 7 smaller tablesAll tables contain DATE and STRING columns that are joined using numeric business keysLow Query Complexity:22 queries with only 1 LEFT JOIN,simple aggregates and subqueries,no nested CTEs,and predicates applied directly to large tablesEasier Tuning Comp
33、lexity:often“super-tuned”and each query gets a perfectly covering indexDoes not require a sophisticated optimizer:needs join reordering and predicate pushdownHow is TPCHow is TPC-H constructed?H constructed?28With the“Easy Button”With the“Easy Button”2024 Databricks Inc.All rights reservedFirst publ
34、ished in 2012 to counter aging TPC-Hs limitations and tackle current OLAP trendsA benchmark that is often misused/misrepresented by skipping one or more of the 3 components designed to ensure operational considerations arent forgotten for the sake of over-indexing on this benchmarks SQL queriesLoad
35、Test,Throughput Test,and Data Maintenance TestA multimulti-cloud data warehouse platform cloud data warehouse platform even publishes a highly-tuned,preloaded TPC-DS dataset in all deployed warehouses for users to consume Still valuable in a vacuum when results can be trusted and validatedTPCTPC-DS:
36、The Popular Kid in SchoolDS:The Popular Kid in School29The one we all fight overThe one we all fight over2024 Databricks Inc.All rights reservedBased on Kimball dimensional modeling(The Data Warehouse Toolkit)Replaced TPC-H 3NF approach with hybrid approach between 3NF and star schema,or a“multiple
37、snowflake schema”How is TPCHow is TPC-DS constructedDS constructed30Retailer selling goods via 3 different distribution channels:Store,Catalog,InternetRetailer selling goods via 3 different distribution channels:Store,Catalog,InternetSignificantly more complicated than TPC-HHeavy on advanced SQL fea
38、tures/functions and lopsided filters99 queries compared to meager 22 in TPC-H4 query classes:pure reporting queriespure ad-hoc queriesiterative OLAP queriesextraction or data mining queries2024 Databricks Inc.All rights reservedFeature Comparison:TPCFeature Comparison:TPC-H&TPCH&TPC-DSDS31Easier to
39、Easier to consume cheat sheetconsume cheat sheetFeatureFeatureTPCTPC-H HTPCTPC-DSDSData Data ModelModelSimpler schema,uses Inmon style DW modelComplex schema,Kimball style dimensional model.SchemaSchema1 very large table(lineitem)7 smaller tables6 fact tables(3 _sales,3 _returns)18 dimension tablesD
40、ata Data TypesTypesAll tables contain DATE and STRING columns.Tables are joined using numeric business keysFact tables use only INTEGER and NUMERIC columns.Only dimension tables use TIMESTAMP and STRINGTables are joined using numeric surrogate keysQuery Query ComplexiComplexityty22 queries:Low compl
41、exity Only uses 1 LEFT JOIN Only uses simple aggregates Subqueries are simple,no nested CTEs Predicates applied directly to large tables99 queries:High complexity 9 queries use LEFT JOIN,3 use a cross join Complex aggregates,15 queries use window functions Complex nested CTEs used in most queries Pr
42、edicates applied only to dimension tablesTuning Tuning ComplexiComplexitytyEasier for vendors to tune Often“super-tuned”with perfect indexes Does not require a sophisticated optimizer:Harder for vendors to tune Optimizing a specific query can make others slower Requires a sophisticated query optimiz
43、er:must be 2024 Databricks Inc.All rights reservedSimpler schema,easier to understand and manage.Fewer benchmark queries and they are easy to understandTables contain DATE and STRING columns that are used as predicatesLess realistic,not representative of more complex modern data warehousing needs.Si
44、mple queries do no reflect hyper-complex real world queries from tools like Tableau and dbtSimple schema does not reflect best practices such as SCD type 2.Easy Easy PeasyPeasy ManManToo Simple and Easy to Shortcut.Been Too Simple and Easy to Shortcut.Been replaced!replaced!32Is the TPCIs the TPC-H
45、Valuable?H Valuable?Best for:AdBest for:Ad-Hoc Manual BenchmarkingHoc Manual Benchmarking2024 Databricks Inc.All rights reservedComplex,realistic schema that better mimics enterprise data warehouses.Covers a broad spectrum of query types,SQL operators,and complex joins.Requires a sophisticated optim
46、izer,testing more capabilities.Higher complexity in setup and longer time to implement and tune.Many complex queries can make the results hard to evaluate.Can require significant resources to fully utilize and understand performance implications.Most modern of the Common SQL Most modern of the Commo
47、n SQL BenchmarksBenchmarksComplexity&Popularity Result in Missed Complexity&Popularity Result in Missed StagesStages33Is the TPCIs the TPC-DS Valuable?DS Valuable?Best for:Vendor Supported POC EvaluationsBest for:Vendor Supported POC Evaluations2024 Databricks Inc.All rights reserved2024 Databricks
48、Inc.All rights reservedTPCTPC-?34How do we get to here?Benchmark it allLHLH2024 Databricks Inc.All rights reservedThe State of“Lakehouse”BenchmarksThe State of“Lakehouse”Benchmarks35Each benchmark focuses on a portion of end-to-end Lakehouse platformFavors bias and“shortcuts”to improve performance R
49、eveals flaws in keeping benchmarks currentExample:TPC-DI has no way to calculate its benchmarked metric for cloud platformsNo Real“Official”Lakehouse BenchmarkNo Real“Official”Lakehouse BenchmarkLHBenchLHBench:Berkeley white paper implementing a Lakehouse benchmark on EMRComposed of 4 tests:TPC-DSTP
50、C-DS RefreshMerge MicrobenchmarkLarge File CountPattern appears sound and its a great start-but not“officialofficial”Community may want to move beyond TPC-DS as the core of the benchmark.How can the industry do better?How can the industry do better?2024 Databricks Inc.All rights reservedThought Expe
51、riment:Cluster TPCThought Experiment:Cluster TPC-DIDI36How costly is optimizing and what can be learned to“balance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?Modified for OPTIMIZEOPTIMIZEon all fact tables Adjusting for cluster start times
52、,it takes 44%longer44%longer15 29s*adjust for cluster start diff$6.42 On-Demand nodes($4.92 on spot)Despite tuning the tables this price is still less than:Half the price of EMR 1/3 the price of Big Query Over 15x cheaper than any other CDW we d f2024 Databricks Inc.All rights reservedThought Experi
53、ment:Cluster TPCThought Experiment:Cluster TPC-DIDI37How costly is optimizing and what can be learned to“balance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?Modified for OPTIMIZEOPTIMIZEon all fact tables Adjusting for cluster start times,i
54、t takes 44%longer44%longer15 29s*adjust for cluster start diffIs this worth it?Is this worth it?The answer is always the same:if if consumption savingssavings are greatergreater thanthanthan the costscosts to optimize the data2024 Databricks Inc.All rights reservedThought Experiment:Cluster TPCThoug
55、ht Experiment:Cluster TPC-DIDI38 Autostatsfeature:Stats on Write!Leverages Liquid ClusteringHow costly is optimizing and what can be learned to“balance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?STATSNO STATS2024 Databricks Inc.All rights
56、reservedThought Experiment:Cluster TPCThought Experiment:Cluster TPC-DIDI39 Autostatsfeature:Stats on Write!Leverages Liquid ClusteringHow costly is optimizing and what can be learned to“balance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?P
57、oint lookup ad-hoc type queryNot OptimizedOptimized.35s vs 10.2s30 x improvement2024 Databricks Inc.All rights reserved40Thought Experiment:Cluster TPCThought Experiment:Cluster TPC-DIDI Autostatsfeature:Stats on Write!Leverages Liquid ClusteringHow costly is optimizing and what can be learned to“ba
58、lance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?Point lookup ad-hoc type query BI-like Query using Dimensional filtering and dynamic file pruningNot OptimizedOptimized20 x task time improvement!2024 Databricks Inc.All rights reserved41Tho
59、ught Experiment:Cluster TPCThought Experiment:Cluster TPC-DIDIHow costly is optimizing and what can be learned to“balance”a Lakehouse How costly is optimizing and what can be learned to“balance”a Lakehouse benchmark?benchmark?How do we balance the Query Load based on the TPC-DI optimize added latenc
60、y?Back of the napkin mathConservatively assume 2x performance gains for SQLAssume 5 minutes longer to optimize ETL(at 10k scale factor)need to save at 5 minutes in queries5=2Therefore if there is 2x improvement in the SQL times and we need to make up 5 minutes,we need approximately 10 minutes of non
61、-optimized tables query time5 minutes on optimized tables2024 Databricks Inc.All rights reserved42AI BenchmarkingAI Benchmarkinghttps:/harvard-edge.github.io/cs249r_book/contents/benchmarking/benchmarking.html2024 Databricks Inc.All rights reservedStandardized methods allow us to quantitatively know
62、 the capabilities of different models,software,and hardware enabling fair comparisons across different solutions.Allow ML developers to measure the inference time,memory usage,power consumption,and other metrics that characterize a system.Goals and Objectives:Performance assessmentResource evaluatio
63、nValidation and verificationCompetitive analysisCredibilityRegulation and StandardizationWhy Benchmark in AI/ML?Why Benchmark in AI/ML?432024 Databricks Inc.All rights reserved3 primary categories:Hardware/SystemModelDataGranularity:MicroMacroEnd to EndTraining vs InferenceWhat to Benchmark in AI/ML
64、?What to Benchmark in AI/ML?44How does one benchmark something so subjective?How does one benchmark something so subjective?https:/harvard-edge.github.io/cs249r_book/contents/benchmarking/benchmarking.html2024 Databricks Inc.All rights reserved45In an LLM Not Far Away2024 Databricks Inc.All rights r
65、eserved According to Stanfords 2024 AI INDEX REPORT 15 benchmarks were deprecateddeprecated in 2023 alone-many of which were less than 4 years old 18 newnew benchmarks were added in 2023 The“Mosaic Evaluation Gauntlet”(blog)Evaluated 39 public benchmarks split across 6 core competenciesIn order to p
66、rioritize the metrics that are most useful for research tasks across model scales,we tested the benchmarks using a series of increasingly advanced modelsKeeping Pace and Choosing WiselyKeeping Pace and Choosing Wisely46Benchmarks are rapidly created and deprecated,what can Mosaics Benchmarks are rap
67、idly created and deprecated,what can Mosaics Gauntlet teach us?Gauntlet teach us?2024 Databricks Inc.All rights reservedPractitioners are growing incredibly skeptical about Academic BenchmarksHabitual issues overfitting models to existing benchmarksMMLU,HumanEval,Hellaswag are bona fide benchmarks b
68、ut model creators game the system for models to do well on themAccordingly,practitioners today tend to prefer evaluating their LLM options by human preference in the real-world-like LMSYSThe HAI Stanford Report even points out“human evaluation is in”(Chapter 2)LMSYS:Allows users to vote on the better response based on a prompt they provide to the LLMs-the user is blind to the choice of the models theyre given).Challenges and TrendsChallenges and Trends47Human evaluation is“in”Human evaluation is“in”2024 Databricks Inc.All rights reserved48Q&AQ&A