1、NVIDIAINTEGRATEHUGECTRWITHEMBEDDINGTENSORFLOW董建兵(JianbiingDong),Dec.17h2020#page#HugeCTR: Scalable, Accelerated Traininghttps:/ is a highly efficient GPU framework and reference design dedicated for Click-Through-Rate (CTR) estimatingtraining.Fast: Fastest available solution in MLPerf vO.7.Achieves
2、a speedup of up to 114X over TensorFlow on a 40-core GPUnode,and up to8.3X that of TensorFlowwitha single V100 GPU.Dedicated:Training with Terabyte models on single/ multi-nodesGPU hash table and dynamic insertion on stream trainingSupporting variants recommendation models: WDL / DCN / DeepFM / DLRM
3、 etcEasy to use and Flexible: Python/C+ interface, JSON-based Network Configuration#page#Framework for Recommendation SystemEmbedding for stream training High performance GPU hashtable based on cudfSupporting dynamic insertionResolving colision in flyUnified EmbeddingEmbeddingALembeddings (multi fea
4、ture fields) in oneHashtableHashtableHashtableHashtableFused Computation / Transaction / UpdateSorting based parameter update to reduce memory footprintSparse InputsNatively multi-hot supportDistriibute Embeddings to multi GPU#page#AGENDAIntroductionWhat is special of HugeCTR EmbeddingUsage Guide 8
5、SamplesHow to define DNN models with the pluginPerformancePerformance comparison#page#INTRODUCTION#page#HugeCTR Embeddingmax_nnzbatch_size * slot numInputHash TableEmbedding TableOutputDataUnifies multi slots (feature fields)into one embedding table.GPU hash table to support dynamic insertion in str
6、eam trraining#page#HugeCTR EmbeddingCSR 1Hash Table 1Embedding Table 1Output 1Reduce ScatterCSR2Hash Table 2Embedding Table 1Output 2InputDataCSRnOutput nHash Table nEmbedding Table nHash table and embedding table are both split.when looking up embedding vectors, each GPU works independently.After t
7、hat,collective operations among all GPUs are conducted to obtain embedding outputs.TWO types of embedding based on the mode of distributing input datas to each GPUDistributed Embeddinga Localiized Embedding#page#HugeCTR Embedding in TensorFlowCSR1Hash Table landanoEmbedding Table 1CSR2Hash Table 2Em
8、bedding Table 1Output 2InputDataandanoCSRnHash Table nEmbedding Table nOutputnInputs and outputs reside on the same device.Supports model parallelism for big embedding table,Use its own highly efficient optimizer to update parameters.#page#Usage GuideAPI SignaturesCreate resource manager for plugini
9、mporthugectr_tf_opsinit()tensorflow as tfimportCreate an unified embedding layerdevices=tf.config.list_physical_devices(“GPU”)create_embedding(.)for dev in devices:tf.config.experimental.set_memory_growth(dev,True)Do forward propagationfprop_v4()Dump / restore parameters to /from filessave(.)/restor
10、e(#page#Usage GuideAPI SignaturesCreate resource manager for plugirhugectr_tf_ops.init(visible_gpus,seed,key_type,value_type,visible_gpus:use which gpus for pluginbatch_size,batch_size_eval)seed: random seed for parameter initializationkey_type:input keys datatypevalue_type: embedding vector data ty
11、pebatch_size fbatch_size_eval:numberofsamples in training/evaluation process#page#Usage GuideAPI Signaturesambedding_name=Create an unified embedding layerhugectr_tf_ops.create_embedding(init_value,name_embedding_type,init_value: Boolean ornumpy arrayoptimizer_type,embedding type: distributed or loc
12、alizedmax_vocabulary_sizeper_gpu,slot_size_array,embedding_vec_size:embedding vector size.opt_hparams,update_type,optimizer_type8opt_hparams:defineatomic_update,optimizers used in embedding pluginscaler,slot_numcombiner: how to combine embedding vectorsinmulti hot.max_nnz,embedding_vec_size,Will gen
13、erate an unique embedding_name forcombiner)thisembeddinglayer#page#SamplesDefine plugin embedding layerclass PluginEmbedding(tf.keras.layers.Layer)definit_(self,.):Super(PluginEmbedding,self)_init_()self.name_=hugectr_tf_ops.create_embedding(.)defbuild(self,):self.bp_trigger= self.add_weight(name=bp
14、_trigger”shape=(1,),dtype=tf.float32,trainable=True)tf.functiondef call(self,row_indices,values,training=True):return hugectr_tf_ops.fprop_v4(self.name_,row_indices,values,self.bp_trigger,training)#page#SamplesDefine DNN model without data parallelismmodel= DNNModel(.)class DNNModel(tf.keras.models.
15、Model):optimizer=def-init_(self,.):Super(DNNModel,self)_init_()tf.function#define embedding layerdef _train_step(inputs,labels):self.embedding_layer=PluginEmbedding(.)with tf.GradientTape()as tape:logits=model(inputs)#defineother layers1oss=loss_fn(logits,labels)Self.dense = tf.keras.layers.Dense(.)
16、grads=tape.gradient(loss,tf.functionmodel.trainable weights)def call(self,inputs):embeddings= self.embedding_layer(inputs)optimizer.apply_gradients(zip(grads,logits=self.dense(embeddings)model.trainable_weights)return 1ogitsreturn lossnViO#page#SamplesDefine DNN model with data parallelismclass Spar
17、seModel(tf.keras.model.Model):class DenseModel(tf.keras.model.Model):definit(self,.):def_init_(self,.):super(SparseModel,self)_init_()super(DenseModel,self)_init_()#only embeddinglayerSelf.dense_layers =self.embedding_layer =1tf.keras.layers.dense(.)PluginEmbedding(.,)tf.functiontf.functiondef call(
18、self,inputs):def call(self,inputs):return self.embedding_layer(inputs)return self.dense_layers(inputs)#page#SamplesDefine DNN model with data parallelismsparseSparMadetf.functiondef total_train step(inputs, labels)Parse optwith tf.GradientTape()as tape:strategy=tf.distribute.Mirroredstrategy(.,)embe
19、ddingssparse_model(inputs)dense_model=DenseModel(.)dense_opt=,dense_inputs = tf.split (embeddings, lentgpus)dense_labels=tf.split(labels,lengpus)dense_inputs replicas - PerReplica(dense inputs)etf.functiondense_labels_replicas= PerReplica(dense_labels)def dense_train step(dense_inputs, dense_labels)
20、:withtf.GradientTape()astapedense_lossembedding gradsstrategy.runtape.watehdense_inutsdense train step_replicas,dense logits = dense model tdense inputs)dense1abels replicas))dense_1oss - 1oss_fn (dense_labels, dense_iogits)all_grads=tf.concat(embedding_grads.values)dense_grads,inputs_grads=tape.gra
21、dient(denge_1oss(dense_model.trainable_weights,dense_inputsl)grads=tape.gradient(embeddingsdense_opt.apply_gradients (ziptdense_grads,SparseModel.trainable weightdensemodel.trainable weights)output gradients=all grads)return dense_1oss,inputs_gradsSparse opt.apply_gradlents(zipgrads,SparseModel.trai
22、nable weights)return strategy.reduce(tf,distribute.Reduceop,SUMdens#page#Performance ComparisonKaggle criteo dataset1Embedding+7FC(6*1024units+1*1unit)(Distributed)Batch_size=65536,embedding_vector_size=32,vocabulary_size=1737710DGX1Mirroredstrategy for data parallelismComparison object: tf.nn.embed
23、ding_lookup sparse#page#Performance Comparisonplugin-endzendorigin-embeddingorigin-end2endplugin-embedding400.0:4.7x Speedup to TF on Embedding350.03.4x Speedup End to End300.0250.0卷200.082150.0100.050.0680.02unoond#page#Check it out on GitHub:https:/ / plugin.ipynb#page#RELATED SESSIONS IN GTC CHIN
24、ALearning More About Nvidia MerlinMerlin:GPU加速的推薦條統框察CNS20590-王澤意,英偉達正太AI開發者技術經理,NVIDIAMerlinHugeCTR:深入研究性能優化CNS20516-MinseokLee,GPU計算專家,NVIDIAMerlinNVTabular:基于GPU加速的推薦系統特征工程最佳實踐CNS20624-黃孟迪,深度學習工程師,NVIDIAGPU加速的數據處理在推薦余統中的應用CNS20813-魏英燦,GPU計算專家,NVIDIA將HugeCTREmbedding集成于TensorFlowICNS20377-董建兵,GPU計算專家,NVIDIA使用GPUembeddingcache加速CTR推理過程CNS20626-布凡,GPU計算專家,NVIDIAnviD#page#Thank You!NVIDIA