《基于 Apache Spark的大規模分布式機器學習實踐(26頁).pdf》由會員分享,可在線閱讀,更多相關《基于 Apache Spark的大規模分布式機器學習實踐(26頁).pdf(26頁珍藏版)》請在三個皮匠報告上搜索。
1、Zhichao LiSenior Software Development Engineer,Intel Corporation基于 Apache*Spark*的大規模分布式機器學習實踐法律聲明英特爾技術特性和優勢取決于系統配置,并可能需要支持的硬件、軟件或服務得以激活。產品性能會基于系統配置有所變化。沒有計算機系統是絕對安全的。更多信息,請見,或從原始設備制造商或零售商處獲得更多信息。在特定系統中對組件性能進行特定測試。硬件、軟件或配置的任何差異都可能影響實際性能。請進行多方咨詢,以評估您考慮購買的系統或組件的性能。關于性能及基準數據的更完整的信息,敬請登陸:http:/ SYSmark 和
2、 MobileMark 等測試均系基于特定計算機系統、硬件、軟件、操作系統及功能,上述任何要素的變動都有可能導致測試結果的變化。請參考其它信息及性能測試(包括結合其它產品使用時的運行性能)以對目標產品進行全面評估。更多信息請訪問 http:/ SEC 報告中包含關于可能影響英特爾結果和計劃的因素的詳細討論,包括有關 10-K 報表的年度報告。所有涉及的所有產品、計算機系統、日期和數字信息均為依據當前期望得出的初步結果,可能隨時更改,恕不另行通知。所述產品可能包含設計缺陷或錯誤(已在勘誤表中注明),這可能會使產品偏離已經發布的技術規范。英特爾提供最新的勘誤表備索。英特爾不對本文中引用的第三方基準
3、數據或網站承擔任何控制或審計的責任。您需要訪問參考網站以確認所引用數據是否準確。英特爾、英特爾標識、Intel.Experience Whats Inside 標識是英特爾公司在美國和/或其他國家的商標。*其他的名稱和品牌可能是其他所有者的資產。2016英特爾公司版權所有。所有權保留。3Content Project Overview Distributed ML on Spark-Fraud Detection:End-to-End Solution for Top Payments Company-Large-scale,Sparse Logistic Regression for Cli
4、ck-through and Purchase Rate Predictions-Deep(Convolutional)neural network Infrastructure support for distributed ML-Parameter server4 Research and open source project initiated by UC Berkeley AMPLab Intel is closely collaborating with AMPLab and the community on open source development-One of the e
5、arliest adopters of Spark*(since 2012)Many key contributions(Netty shuffle,FairScheduler,“yarn-client”mode,)-Collaborating on other components in BDAS(e.g.,Tachyon*,SparkR,)Intel is partnering with many“web-scale”companies-Free!No commercial solution or Consultations-Online-LDA,Word2Vec(Merged)-Spar
6、seML(Separated package)-E.g.,Tencent,PayPal*,Alibaba*,Baidu*/iQiyi,JD.com,Youku*,etc.Project OverviewBDAS:Berkeley Data Analytics Stack(Ref:https:/amplab.cs.berkeley.edu/software/)SparkStreamingSpark CoreSampleCleanG-OLABlinkDBSparkSQLVelox*SparkRGraphXSplashMLBaseMLlibMLPipelinesMesos*Hadoop*YarnHD
7、FS,S3,Ceph*AMPLab DevelopedSpark CommunityIn Development3rdPartyTachyon*Succinct5Distributed ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase Rate Predictions Deep(Convolutional)neural networkInfrastructure
8、 support for distributed ML Parameter serverLarge-Scale Distributed ML on Apache Spark56Goal:Given transaction details,classify if its fraud or normalEvaluation Matrices Recall=predicted fraud/all real fraud transaction.Precision=predicted fraud correctly /predicted fraudFraud Detection on Apache Sp
9、arkFraud can mean:Buying with stolen credit cardsAbusing promotional programsAccount takeoverSpamming other users7Intel Customer StoryProblem statement and Pain points-An old rule-based system that needs significant improvement-Turn to Spark for data statistics and model training-Need Neural Network
10、 for Fraud Detection on their Spark 1.4 clusterIntel Solution-Implement Neural Network on Spark and help integrate Business Result-Neural network model performs better than other algorithm-Machine Learning system overtakes rule-based system and exceeds expectation-Improve precision by 15%,improve re
11、call by 30%8Solution Architecture OverviewTrain one modelall featuresselected featuresmodelsampledpartitionTraining DatanormalfraudTrain one modelTrain one modelPost-Processing Pre-ProcessingmodelmodelSpark PipelineTest DataPredictionsTestSpark*DataFrameHive*TablePreprocessingFeature EngineeringFeat
12、ure Engineer-ingFeature SelectionModel EnsembleSpark PipelineNeural Net ModelFeature SelectionModel TrainingModel Evaluation&Fine Tunemodel candidate9ApplicationTool Stack OverviewFeature EngineeringApache*SparkML PipelineOneHotEncoderWOEQuantile DiscretizerFraud Detection(Driver)Spark CommunityInte
13、l DevelopedString IndexerPre-processingSampling UtilitySpark*SQLIn DevelopmentStatisticsFeature SelectionModel TrainingModel EnsembleModel Evaluation&Fine TuneEstimatorGrid SearchNeural Net ModelBagging UtilityStandardizerBinary Class EvaluatorStep-wise Feature SelectorPost-ProcessingCross Validatio
14、nModel SelectorIntel Improved10Distributed ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase RatePredictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-S
15、cale Distributed ML on Apache Spark1011Logistic Regression on Spark*with Mini-Batch SGD11“Canonical”implementationRepeat Driver broadcasts W to each workerWorkers compute gradient for the next batch of B records from the training setEach task(running on workers)samples records from its data partitio
16、nEach task computes local gradient Aggregates gradient (possibly through tree aggregation)Driver updates weight123 4Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334*Other names and brands may be claimed as the property of others.12Network and Memory B
17、ottlenecksClick-through and purchase rate predictionsAdopted by top internet companies-Model size:100s of millions billions unique featuresWeight(W)and gradient(G)are both double vector,one entry for each unique feature-Training data:billions trillions training samplesPartitioned&cached across worke
18、rs12Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Broadcast W(800MB)to each worker in each iterationEach task computes G(800MB)in each iterationEach task sends G(800MB)for aggregation in each iterationTraining samples cached in worker memory13Click-
19、through and purchase rate predictions Adopted by top internet companies Model size:100s of millions billions unique features Training data:billions trillions training samplesSolution Cached using sparse format Using float16(instead of double values)Extra Support for binary(0 or 1)values Only Calc&sy
20、nc gradient with non-zero data Better CommunicationSparse Logistic RegressionPartition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Gradient:sparse vectorCompacted network communicationData cached using advanced encodingFor more complete information about pe
21、rformance and benchmark results,visit ML on Spark Fraud Detection:End-to-End Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase RatePredictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-Scale
22、Distributed ML on Apache Spark1415Multi-Layer Perceptron(MLP)Fully connected,feed-forwardDeep learning CNN,autoencoder,RBM,etc.Distributed Neural NetworkRepeat Driver broadcasts parameters(weights&biases)to each workerWorkers process the next batch of Brecords from the training setEach task(running
23、on workers)samples records from its data partitionEach task computes the forwardand backpropagation passDriver aggregates gradient Driver updates parameters(weights&biases)123 4Partition 1Partition 2Partition nTraining SetSampleSampleSampleWorkerWorkerWorkerDriver2221113334Training A Neural Network1
24、6Built on top of standard Big Data platforms Easily utilize your existing clustersEngaging industry users and community early Evolving with feedback from real-world use cases Community version compatible with Spark*MLPTargeting Full function coverage:Auto Encoder,Sparse Encoder Convolution with max
25、and avg pooling RBM and DBNBenchmark with popular dataset/models GoogleNet,AlexNet on ImageNetEasy MKLintegration for Intel Architecture accelerationBetter communication:All-to-one,All-reduce on spark(CaffeOnSpark),ParameterServerFree community license(https:/ NetworkIntuitive API with layer-based i
26、nterfaceval trainData=loadData()val model=new Sequential()model+=new Convolution()model+=new maxPooling()val criterion=new ClassNLLCriterion()val optimizer=new ParallelOptimizer(model,new SGD)optimizer.setCrossValidation(evaluator.accuracy)optimizer.setPath(./model_save.obj)optimizer.optimize(trainD
27、ata)17Flaw detection in steel product 10/11Convolutionn(5,5)Maxpooling(2,2,2,2)Convolution(5,5)Maxpooling(2,2,2,2)3001005FCFC18Pipeline10/11ProposalDefect Proposal Algorithm 1ClassificationModelDefectNormalNormalPre-processDefect Proposal Algorithm 219Distributed ML on Spark Fraud Detection:End-to-E
28、nd Solution for Top Payments Company Large-scale,Sparse Logistic Regression for Click-through and Purchase Rate Predictions Deep(Convolutional)neural networkInfrastructure support for distributed ML Parameter serverLarge-Scale Distributed ML on Apache Spark20Communication Model20DriverTaskTaskTaskTa
29、skTaskTaskAll to oneParameter Server104716253511235351351435635355All reduce(tree aggregation)TaskTaskTaskTaskAll reduce21“Parameter Server”support?-Very large scale model/graph(billions of unique features)-Leveraging further data sparsity in each worker(only a subset of weight vector needed)-Possib
30、le weakly-synchronized model(BSP vs.SSP vs.ASP,etc.)-Distributed parameter aggregation&update in Parallel-Easily integration with Apache Spark*.-Fault Torrance-Co-partitioning21*Other names and brands may be claimed as the property of others.Source:Dean J,Corrado G,Monga R,et al.Large scale distribu
31、ted deep networksC/Advances in neural information processing systems.2012:1223-1231.22Reference&ResourcesIntel packages-https:/ Analytics:-https:/ Notices and DisclaimersIntel technologies features and benefits depend on system configuration and may require enabled hardware,software or service activ
32、ation.Learn more at ,or from the OEM or retailer.No computer system can be absolutely secure.Tests document performance of components on a particular test,in specific systems.Differences in hardware,software,or configuration will affect actual performance.Consult other sources of information to eval
33、uate performance as you consider your purchase.For more complete information about performance and benchmark results,visit http:/ reduction scenarios described are intended as examples of how a given Intel-based product,in the specified circumstances and configurations,may affect future costs and pr
34、ovide cost savings.Circumstances will vary.Intel does not guarantee any costs or cost reduction.This document contains information on products,services and/or processes in development.All information provided here is subject to change without notice.Contact your Intel representative to obtain the la
35、test forecast,schedule,specifications and roadmaps.Statements in this document that refer to Intels plans and expectations for the quarter,the year,and the future,are forward-looking statements that involve a number of risks and uncertainties.A detailed discussion of the factors that could affect In
36、tels results and plans is included in Intels SEC filings,including the annual report on Form 10-K.The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications.Current characterized errata are available on request.No
37、license(express or implied,by estoppel or otherwise)to any intellectual property rights is granted by this document.Intel does not control or audit third-party benchmark data or the web sites referenced in this document.You should visit the referenced web site and confirm whether referenced data are
38、 accurate.Intel,and the Intel logo are trademarks of Intel Corporation in the United States and other countries.*Other names and brands may be claimed as the property of others.2016 Intel Corporation.24Optimization NoticeIntels compilers may or may not optimize to the same degree for non-Intel micro
39、processors for optimizations that are not unique to Intel microprocessors.These optimizations include SSE2,SSE3,and SSE3 instruction sets and other optimizations.Intel does not guarantee the availability,functionality,or effectiveness of any optimization on microprocessors not manufactured by Intel.
40、Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors.Please refer to the applicable product User and Reference Guides for more information regarding
41、the specific instruction sets covered by this notice.Notice revision#20110804 25Risk FactorsThe above statements and any others in this document that refer to future plans and expectations are forward-looking statements that involve a number of risks and uncertainties.Words such as anticipates,expec
42、ts,intends,goals,plans,believes,seeks,estimates,continues,may,will,should,and variations of such words and similar expressions are intended to identify such forward-looking statements.Statements that refer to or are based on projections,uncertain events or assumptions also identify forward-looking s
43、tatements.Many factors could affect Intels actual results,and variances from Intels current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements.Intel presently considers the following to be important factors tha
44、t could cause actual results to differ materially from the companys expectations.Demand for Intels products is highly variable and could differ from expectations due to factors including changes in business and economic conditions;consumer confidence or income levels;the introduction,availability an
45、d market acceptance of Intels products,products used together with Intel products and competitors products;competitive and pricing pressures,including actions taken by competitors;supply constraints and other disruptions affecting customers;changes in customer order patterns including order cancella
46、tions;and changes in the level of inventory at customers.Intels gross margin percentage could vary significantly from expectations based on capacity utilization;variations in inventory valuation,including variations related to the timing of qualifying products for sale;changes in revenue levels;segm
47、ent product mix;the timing and execution of the manufacturing ramp and associated costs;excess or obsolete inventory;changes in unit costs;defects or disruptions in the supply of materials or resources;and product manufacturing quality/yields.Variations in gross margin may also be caused by the timi
48、ng of Intel product introductions and related expenses,including marketing expenses,and Intels ability to respond quickly to technological developments and to introduce new products or incorporate new features into existing products,which may result in restructuring and asset impairment charges.Inte
49、ls results could be affected by adverse economic,social,political and physical/infrastructure conditions in countries where Intel,its customers or its suppliers operate,including military conflict and other security risks,natural disasters,infrastructure disruptions,health concerns and fluctuations
50、in currency exchange rates.Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations,which could be changed without prior notice.Intel operates in highly competitive industries and its operations have high cos
51、ts that are either fixed or difficult to reduce in the short term.The amount,timing and execution of Intels stock repurchase program could be affected by changes in Intels priorities for the use of cash,such as operational spending,capital spending,acquisitions,and as a result of changes to Intels c
52、ash flows or changes in tax laws.Product defects or errata(deviations from published specifications)may adversely impact our expenses,revenues and reputation.Intels results could be affected by litigation or regulatory matters involving intellectual property,stockholder,consumer,antitrust,disclosure
53、 and other issues.An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products,precluding particular business practices,impacting Intels ability to design its products,or requiring other remedies such as compulsory licensi
54、ng of intellectual property.Intels results may be affected by the timing of closing of acquisitions,divestitures and other significant transactions.We completed our acquisition of Altera on December 28,2015 and risks associated with that acquisition are described in the“Forward Looking Statements”pa
55、ragraph of Intels press release dated June 1,2015,which risk factors are incorporated by reference herein.A detailed discussion of these and other factors that could affect Intels results is included in Intels SEC filings,including the companys most recent reports on Form 10-Q,Form 10-K and earnings release.Rev.1/14/1626