《Merlin NVTabular:基于 GPU 加速的推薦系統特征工程最佳實踐.pdf》由會員分享,可在線閱讀,更多相關《Merlin NVTabular:基于 GPU 加速的推薦系統特征工程最佳實踐.pdf(29頁珍藏版)》請在三個皮匠報告上搜索。
1、NVIDIAMerlinNVTabular:基于GPU加速的推薦系統特征工程最佳實踐黃孟迪,NVIDIA深度學習工程師#page#RELATED SESSIONS IN GTC CHINALearning More About NVIDIA MerlinMerlin:GPU加速的推薦系統框架CNS20590王澤衰,英偉達亞太AI開發者技術經理,NVIDIAMerlinHugeCTR:深入研究性能優化CNS20516MinseokLee,GPU計算專家,NVIDIAMerlinNVTabular:基于GPU加速的推薦系統特征工程最佳實踐CNS20624黃孟迪,深度學習工程師,NVIDIAGPU加
2、速的數據處理在推薦系統中的應用CNS20813魏英燦,GPU計算專家,NVIDIA將HugeCTREmbedding集成于TensorFlowCNS20377董建兵,GPU計算專家,NVIDIA使用GPUembeddingcache加速CTR推理過程CNS20626郁凡,GPU計算專家,NVIDIA#page#Merlin OverviewNVTabular- Merlin ETLTutorials- Best Practices For RecSysAgendaFeature EngineeringGoal 1:lmproving Model AccuracyGoal 2: Quick ex
3、perimentation with GPU AccelerationGoal 3: Scale to Production Systems With NVTabular#page#NVIDIAMerlin Overrview#page#Industrial Recommendation ChallengesTrainingFeatureEmbeddingDataloadingDeploymentHigh AccuracyExplorationTablesTabular data scalesLarge embeddingHigh throughput toMultiple iteration
4、s canLonger iteration cyclestables requirerank more items ispoorly using thereduce the ability toconsume a lot of timesignificantmemorydifficult whilecommon deep learningtofindthemostreach highermethod of item byand lookups can havemaintaining lowaccurate feature setaccuracies as quicklylatencyiteme
5、xtrraneous operations#page#Merlin Framework BenefitsNVTabularHugeCTRTritonFeatureDeploymentDataloadingScaling TrainingHigh AccuracyEngineeringHighthroughput,low-Fast iteration time,Acceleratetabular dataOptimal lookupsReach higher accuracylatency productionloading into trainingimplementation.faster.
6、deployment.Prepare massiveframeworks.datasetsin minutesEasy to use data andShorten exploration andInference time dataallowing for moreAsynchronous batchmodel parallel trainingtraining cycles to reachtransforms and multiexploration and betterdataloading meanstheallow you to scale to TBhigher accuraci
7、es soonermodel support providemodels.GPU is always utilizedsizedembeddingsmaximumthroughputwith latencyconstraints#page#NVIDIAMerlin ETLNVTabular-Fast Feature Transforms 8 Dataloading of Tabular Data on GPU#page#Day In The Life Of A Data ScientistThe average data scientist spends 75% of their time i
8、n ETLAccelerated ComputeRegular ComputeForgot to Add aTrain ModelFeatureValidateTest ModelRestart Data Prep1212StartWorkflownt withRepeatConfigure Data PrepWorkflowGPUCPUPOWEREDPOWERED9中WORKFLOWWORKFLOWFindUnexpectedNul Values StoredDatasetasStringDownloadsDatasetOvemightRestartData PrepDownloadswor
9、kflowAgainOvernightGo Home onTimeStay LateData PrepAnalysisTrainInferenceDatasetCollection#page#NVTabular: Recommender System ETL on GPUMerlin ETLNVTabularNVTabularWhat it is:Feature engineering and preprocessing library designed toEXTRACTLOADTRANSFORMquickly and easily manipulate terabytes of tabul
10、ar dataWhat its capable of:Scale- No limit on dataset size (not bound by GPU or CPULOBmemory)Speed-GPU acceleration 10x speedup compared to CPUeliminate input bottleneckETL-ExtracUsability- Higher level abstraction, recommender systemsoriented, fewer API calls are required to accomplish thesame proc
11、essing pipeline.Interoperabiliity with PyTorch, TensorFlow,and HugeCTR#page#RecSys Pipeline Example using NVTabularBuiding ofthe datasetLoading ofthedatasetTrainingNVTabularNVTabularOflineFrameworkML-readyModelFeaturesFeaturePreProc ofData LakedatasetspecificTrainingdatasetsEngineeringDataloaderTrai
12、ning data(PTITFUpto 1EBUptUp to10PBsUp to 1PBHugeCTR)Upto 100TBs100TBsUp to 100TBSPreProcModelCandidateWeightsConfigInferenceGeneration ServerFind high-recallInference time transformation of the datacandidatesforfuriherrankingOnline Inference Server(TRITON)(up to 1 bilion inferences/second:low-laten
13、cy budget)RecommendationInferenceNVTabularWeb ServicesServerTRTCandidatesOnline FeatureModelTensors.Engineering8(socialmedia,adsRecaives raquesis/InferenceDataFramesPreprocessingbookings.fraud.)Preparesdata/dmatrix.etcReturrnsrecs#page#Latest Merlin Release- NVTabular 0.3Core FeaturesoMulti-GPU supp
14、ort using Dask CUDFo Dataloaders for PyTorch, Tensorflow Keras, HugeCTRA100 Support: Rapids 0.15,CUDA 11。Data input:S3,GCS,HDFS Formats: ORC Multi node supportS Multi hot support10 New Operators from RecSys 2020 WinTargetEncoding,Differencelag,Column Similarity Dropna, Filter,FilMedian, HashBucket,
15、JoinGroupby, JoinExternal, LambdaOp#page#NVTabular Data LoadersNVTabularv0.3+Read in file in blockRepeat untilShuffle blocks inParquet Iterable Batchchunksfinished with theGPU memoryDataloader(GPU memory)datasetTensorflow DataloaderPytorch DataloaderPreliminary results:Up to4xend-to-endimprovement c
16、omparedto native PyTorch dataimprovement compared to native TF data loaderloaderWorkflowFeaturesRead large chunks of data intoa dedicatedRead large chunks of data intoGPU memorysegment of GPU memory bufferShuffle in memory buffer(per buffersize)Shuffle in GPU memory buffer(per buffer size)Movebatch
17、sized tensorsintoframeworkframeworkBenefitsBenefitsRemoves dataloading bottleneckNoitem levelreads offiles/memoyEliminatesthelessefficient item/file level readsAllows forhigher sizesEnables larger sized batchesandfiles#page#Example NVTabular API Workflow100x fewer lines of code requiredg1obSpecify w
18、hich variables areCategorical and which are1abe1_namContinuousoll featuColumn#initialize Morkflonnvt.Norfklowcat_label_Define the location of the# create datsets from input filestraining and validation set=glob.glob(./dataset/valid/*.pvalid files=nvt.dataset(train_files.train datasetyfrac=8.1Encode
19、Categoricals using the=nvt.dataset(valid_files,valid datasetdefined thresholds.#add featureLog transform the Continuousroc.add_cont_preprocess(nvt.ops.lormalize()variables, Zero filling any nullshold=15)Apply the operations,creating aout_files=len(trainfiles)proc.apply(train_dataset,shuffle=Truue,ou
20、tput_path=new shuffled training datasetproc.opply(valid_dataset,shuffle=False,andavalidation dataset.Pandas/numPy example provided by DLRM is 1200 lines of codeNVTabulars high-level API is 10- 20 liines of code,#page#Feature engineering operator supportCategorify:Categorify operation can be added to
21、 the workflow to transform categorical features into unique integer values.Clip:This operation clips continuous values so that they are within a min/max boundFilMissing: This operation replaces missing values with a constant pre-defined valueLogOp: This operator calculates the log of continuous colu
22、mns.Moments: Moments operation calculates some of the statistics of features including mean, variance, standarded deviation, andcount,MinMax:MinMax operation calculates min andmax statistics offeatures.Normalize:This operation can be added to the workflow to standardize the features.NormalizeMinMax:
23、This operation can be added tothe workflow to standardize thefeatures.TargetEncoding: Target encoding is a common feature-engineering technique for categorical columns in tabular datasetsMedian:This operation calculates median of features.lau auespau se npod JauulJo ausopl-J Sulsn suunio oM uaamag K
24、ueitulsau saeinie:aleitus uunioDropna:This operation detects missing values,andfilters out rows with null valuesFilter: Filters rows from the dataset. This works by taking a callable that takes a dataframe,and returns a dataframe withunwanted rows filtered out.FilMedian: This operation replaces miss
25、ing values with the median value for the column.HashBucket: This op maps categorical columns to a contiguous integer range by first hashing the column then modulating by thenumber of buckets as indicated by num_buckets.JoinGroupby: This operator groups the data by the given categorical featurels) an
26、d calculates the desired statistics of requestedContinuous features.JoinExternal: Join each dataset partition to an external table.Lambdaop:LambdaOp allows you to apply row level functions toa NVTabular workflow.Source:https:/nvidia.github.io/NVTabular/main/api/ops/index.html#page#Case Study: Criteo
27、 1TB Ads Dataset80x Speedup over CPU for ETL and 114X over Tensorflow on a 40-core CPU node for Training7.5daysNumpy CPU ETL + PyTorch CPU Training2days7.5days total=5.5days ETL+2 days Training4 hoursSpark CPU ETL + PyTorch GPU Training5.5d84 hrs total=3hrs ETL+1hr Training自3.00Merlin: NVTabular + H
28、ugeCTR5.2mins5.2 mins total=1.9 mins ETL+3.3mins Training03Performance:Fastest0CPETL032G8FP18)aBenchmarkScript on Gihub: to NVTabularNVTabular is better than Spark and Pandas for tabular recommenders ETLNVTabular is focused on tabular deep learning recommenderso Native tabular data format support: C
29、SV, parquet, orc, avroS Easy to implement the most common workflowso No limit on dataset size (not bound by GPU or CPU memory) Optimized TF,PyT,and HugeCTR dataloadersS Integrated and extensible with RAPIDS Dask CuDFIntegrated with TensorFlow Serving and Triton for production inferenceo Examples pub
30、lished for common datasets and modelso Building An easy path to production deployment for data transformso Consistency between data during training and inferencenVIDI#page#NVIDIATutorials- BestPracticesForRecSys FeatureEngineering#page#Background: cuDFDask CUDFPythonCuDFPandasCythonCUDF C+CUDA Libra
31、riesThrustCubJitifyCUDASource: RAPIDS.AI presentation#page#Scaling beyond system memory / GPU memoryDescriptionLimitationLimited by thesystem memoryLibrary for data manipulation and analysis onFl pandasCPU and system memoryLimited by the GPU memoryLibrary for data manipulation and analysis onRAPIDSG
32、PU and GPU memoryWrapper for pandas / cudf with lazy executionPartitions data in chunksnolimitationto optimize and scale beyond system memoryand GPU memoryWrapper around dask_cudf to provide bestPartitions data in chunksno limitationNVTabularpractices in feature engineering and simplifyAPI(from 100-
33、1000 lines to5-25lines)#page#What is Dask?Dask is a task-based library for parallel scheduling and executionDask decomposes large DataFrames/Series (pandas / cuDF) into a collection ofSKDataFrames/SeriesDask schedules and execute the optimized task graph on one or moreprocesses/threadsExample:Januar
34、y2016PandasFebrary,2016DataFrameDask DataFrame is a collection ofDataFramesDaskEach element can be for example aDataFramoMarch,2016pandas/cuDF DataFrameApril.2016May.2016#page#Experimentation pipeline for RecSys (tabular data)Focus一美XFeatureInput13powandanoPreprocessingEngineeringlterationsMost RecS
35、ys &Tabular Data competitions are won by feature engineering instead of model architecturesaseepp Kueu o ajqelidde saoualladxe ano uo paseq anbyuua jelauab azeuuns jellomn SIuL#page#“Automatic” feature extraction in other domainsDeep Learning = The Entire Machine ls TrainableY LeCuTraditional Patter
36、n Recognition:Fixed/Handcrafted FeatureExtractorTrainableFeatureExtractorClassifierMainstream Modern Pattern Recognition:Unsupervisedmid-levelfeaturesMid-LevelTrainableFeatureExtractorClassifierFeaturesDeepLearning: Representationsare hierarchicaland trainecLow-LeveMid-LeveHigh-leveTrainableFeatures
37、FeaturesFeaturesClassifierolpne pue xa sabeu se uons sueuop lauo u! uoyoenxa aineay oyeuone u! lM suloyad Buuea7 daadAlthough there are attempts for tabular deep learning,adding feature engineering supports models.#page#Performance improvement of 5.9%-13.4%PerformanceXGBoostwithout feature engineeri
38、ng (rawfeatures)yields 0.61AUCwith feature engineering yields in0.646AUC(+5.9%)Deep Learningwithout feature engineering (rawfeatures) yields 0.56 AUCwith feature engineering yields in0.635AUC(+13.4%)Note: Current preliminary resultsXGBootDeepLeaning#page#Dataset of the TutorialDataset: eCommerce beh
39、avior data from multi category storeSource: REES46 Marketing PlatformURL:https:/ target:PurchaseNegative target:AddToCart(removing AddToCarts of purchased items from the same session)Datasetsplit:Training:Oct-2019-Feb-2020(11.4Miosamples)Validation:March-2020(2.4Miosamples)Test:April-2020(2.7Miosamp
40、les)Baseline:37%of events are purchasesFeatures:Userld,Sessionld, ItemldPriceTimestampCategoryBrand#page#Overview Feature TypesBold techniques in focusFeature TypeExampleFeature EngineeringUser ID / Item IDTarget EncodingBrandCategoricalCount EncodingMain CategoryCategorify+Combining CategoriesKeywo
41、rdsTarget EncodingUnstructured listSubcategoriesCount EncodingColorsCategorifyPriceBinningDeliver timeNumericNormalizationAvg.reviewsGauss RankTimestampTimestampExtractmonth,weekday,weekend,hour#ofeventsinpastXEvents in orderTimeseriesTime since last eventProductimageExtract latent representation wi
42、th deep learningImageDescriptionExtractlatentrepresentation withdeep learningTextFollower/Following graphLinkanalysisSocialgraphAddressesDistances to point of interestGeo locationThetutorials will beavailable here:https:/rapidsaideepleamingtreo/main/RecSys2020Tutorial#page#HANDS-ON LABs#page#ADDITIO
43、NAL RESOURCESOverviewNVIDIA GTC Fall 2020 Keynote Part 6:NVIDIA Merlin for Recommendation Systemso Product Page: https:/ BlogsAccelerating ETL for Recommender Systems on NVIDIA GPUs with NVTabularAnnouncing the NVIDIA NVTabular Open Beta with Multi-GPU Support and New Data LoadersGTC SessionsoNVTabular: GPU Accelerated ETL for Recommender SystemsGitHuboNVTabular: https:/ you!Please take this survey to give feedback and receive more information:https:/forms.gle/SvpoHhSdT5bwHwkc7