將 HugeCTR Embedding 集成于 TensorFlow.pdf

編號:29564 PDF 23頁 1.26MB 下載積分:VIP專享
下載報告請您先登錄!

將 HugeCTR Embedding 集成于 TensorFlow.pdf

1、NVIDIAINTEGRATEHUGECTRWITHEMBEDDINGTENSORFLOW董建兵(JianbiingDong),Dec.17h2020#page#HugeCTR: Scalable, Accelerated Traininghttps:/ is a highly efficient GPU framework and reference design dedicated for Click-Through-Rate (CTR) estimatingtraining.Fast: Fastest available solution in MLPerf vO.7.Achieves

2、a speedup of up to 114X over TensorFlow on a 40-core GPUnode,and up to8.3X that of TensorFlowwitha single V100 GPU.Dedicated:Training with Terabyte models on single/ multi-nodesGPU hash table and dynamic insertion on stream trainingSupporting variants recommendation models: WDL / DCN / DeepFM / DLRM

3、 etcEasy to use and Flexible: Python/C+ interface, JSON-based Network Configuration#page#Framework for Recommendation SystemEmbedding for stream training High performance GPU hashtable based on cudfSupporting dynamic insertionResolving colision in flyUnified EmbeddingEmbeddingALembeddings (multi fea

4、ture fields) in oneHashtableHashtableHashtableHashtableFused Computation / Transaction / UpdateSorting based parameter update to reduce memory footprintSparse InputsNatively multi-hot supportDistriibute Embeddings to multi GPU#page#AGENDAIntroductionWhat is special of HugeCTR EmbeddingUsage Guide 8

5、SamplesHow to define DNN models with the pluginPerformancePerformance comparison#page#INTRODUCTION#page#HugeCTR Embeddingmax_nnzbatch_size * slot numInputHash TableEmbedding TableOutputDataUnifies multi slots (feature fields)into one embedding table.GPU hash table to support dynamic insertion in str

6、eam trraining#page#HugeCTR EmbeddingCSR 1Hash Table 1Embedding Table 1Output 1Reduce ScatterCSR2Hash Table 2Embedding Table 1Output 2InputDataCSRnOutput nHash Table nEmbedding Table nHash table and embedding table are both split.when looking up embedding vectors, each GPU works independently.After t

7、hat,collective operations among all GPUs are conducted to obtain embedding outputs.TWO types of embedding based on the mode of distributing input datas to each GPUDistributed Embeddinga Localiized Embedding#page#HugeCTR Embedding in TensorFlowCSR1Hash Table landanoEmbedding Table 1CSR2Hash Table 2Em

8、bedding Table 1Output 2InputDataandanoCSRnHash Table nEmbedding Table nOutputnInputs and outputs reside on the same device.Supports model parallelism for big embedding table,Use its own highly efficient optimizer to update parameters.#page#Usage GuideAPI SignaturesCreate resource manager for plugini

9、mporthugectr_tf_opsinit()tensorflow as tfimportCreate an unified embedding layerdevices=tf.config.list_physical_devices(“GPU”)create_embedding(.)for dev in devices:tf.config.experimental.set_memory_growth(dev,True)Do forward propagationfprop_v4()Dump / restore parameters to /from filessave(.)/restor

10、e(#page#Usage GuideAPI SignaturesCreate resource manager for plugirhugectr_tf_ops.init(visible_gpus,seed,key_type,value_type,visible_gpus:use which gpus for pluginbatch_size,batch_size_eval)seed: random seed for parameter initializationkey_type:input keys datatypevalue_type: embedding vector data ty

11、pebatch_size fbatch_size_eval:numberofsamples in training/evaluation process#page#Usage GuideAPI Signaturesambedding_name=Create an unified embedding layerhugectr_tf_ops.create_embedding(init_value,name_embedding_type,init_value: Boolean ornumpy arrayoptimizer_type,embedding type: distributed or loc

12、alizedmax_vocabulary_sizeper_gpu,slot_size_array,embedding_vec_size:embedding vector size.opt_hparams,update_type,optimizer_type8opt_hparams:defineatomic_update,optimizers used in embedding pluginscaler,slot_numcombiner: how to combine embedding vectorsinmulti hot.max_nnz,embedding_vec_size,Will gen

13、erate an unique embedding_name forcombiner)thisembeddinglayer#page#SamplesDefine plugin embedding layerclass PluginEmbedding(tf.keras.layers.Layer)definit_(self,.):Super(PluginEmbedding,self)_init_()self.name_=hugectr_tf_ops.create_embedding(.)defbuild(self,):self.bp_trigger= self.add_weight(name=bp

14、_trigger”shape=(1,),dtype=tf.float32,trainable=True)tf.functiondef call(self,row_indices,values,training=True):return hugectr_tf_ops.fprop_v4(self.name_,row_indices,values,self.bp_trigger,training)#page#SamplesDefine DNN model without data parallelismmodel= DNNModel(.)class DNNModel(tf.keras.models.

15、Model):optimizer=def-init_(self,.):Super(DNNModel,self)_init_()tf.function#define embedding layerdef _train_step(inputs,labels):self.embedding_layer=PluginEmbedding(.)with tf.GradientTape()as tape:logits=model(inputs)#defineother layers1oss=loss_fn(logits,labels)Self.dense = tf.keras.layers.Dense(.)

16、grads=tape.gradient(loss,tf.functionmodel.trainable weights)def call(self,inputs):embeddings= self.embedding_layer(inputs)optimizer.apply_gradients(zip(grads,logits=self.dense(embeddings)model.trainable_weights)return 1ogitsreturn lossnViO#page#SamplesDefine DNN model with data parallelismclass Spar

17、seModel(tf.keras.model.Model):class DenseModel(tf.keras.model.Model):definit(self,.):def_init_(self,.):super(SparseModel,self)_init_()super(DenseModel,self)_init_()#only embeddinglayerSelf.dense_layers =self.embedding_layer =1tf.keras.layers.dense(.)PluginEmbedding(.,)tf.functiontf.functiondef call(

18、self,inputs):def call(self,inputs):return self.embedding_layer(inputs)return self.dense_layers(inputs)#page#SamplesDefine DNN model with data parallelismsparseSparMadetf.functiondef total_train step(inputs, labels)Parse optwith tf.GradientTape()as tape:strategy=tf.distribute.Mirroredstrategy(.,)embe

19、ddingssparse_model(inputs)dense_model=DenseModel(.)dense_opt=,dense_inputs = tf.split (embeddings, lentgpus)dense_labels=tf.split(labels,lengpus)dense_inputs replicas - PerReplica(dense inputs)etf.functiondense_labels_replicas= PerReplica(dense_labels)def dense_train step(dense_inputs, dense_labels)

20、:withtf.GradientTape()astapedense_lossembedding gradsstrategy.runtape.watehdense_inutsdense train step_replicas,dense logits = dense model tdense inputs)dense1abels replicas))dense_1oss - 1oss_fn (dense_labels, dense_iogits)all_grads=tf.concat(embedding_grads.values)dense_grads,inputs_grads=tape.gra

21、dient(denge_1oss(dense_model.trainable_weights,dense_inputsl)grads=tape.gradient(embeddingsdense_opt.apply_gradients (ziptdense_grads,SparseModel.trainable weightdensemodel.trainable weights)output gradients=all grads)return dense_1oss,inputs_gradsSparse opt.apply_gradlents(zipgrads,SparseModel.trai

22、nable weights)return strategy.reduce(tf,distribute.Reduceop,SUMdens#page#Performance ComparisonKaggle criteo dataset1Embedding+7FC(6*1024units+1*1unit)(Distributed)Batch_size=65536,embedding_vector_size=32,vocabulary_size=1737710DGX1Mirroredstrategy for data parallelismComparison object: tf.nn.embed

23、ding_lookup sparse#page#Performance Comparisonplugin-endzendorigin-embeddingorigin-end2endplugin-embedding400.0:4.7x Speedup to TF on Embedding350.03.4x Speedup End to End300.0250.0卷200.082150.0100.050.0680.02unoond#page#Check it out on GitHub:https:/ / plugin.ipynb#page#RELATED SESSIONS IN GTC CHIN

24、ALearning More About Nvidia MerlinMerlin:GPU加速的推薦條統框察CNS20590-王澤意,英偉達正太AI開發者技術經理,NVIDIAMerlinHugeCTR:深入研究性能優化CNS20516-MinseokLee,GPU計算專家,NVIDIAMerlinNVTabular:基于GPU加速的推薦系統特征工程最佳實踐CNS20624-黃孟迪,深度學習工程師,NVIDIAGPU加速的數據處理在推薦余統中的應用CNS20813-魏英燦,GPU計算專家,NVIDIA將HugeCTREmbedding集成于TensorFlowICNS20377-董建兵,GPU計算專家,NVIDIA使用GPUembeddingcache加速CTR推理過程CNS20626-布凡,GPU計算專家,NVIDIAnviD#page#Thank You!NVIDIA

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(將 HugeCTR Embedding 集成于 TensorFlow.pdf)為本站 (X-iao) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站