《5-1 騰訊文本理解系統 TexSmart 中的細粒度實體識別關鍵技術.pdf》由會員分享,可在線閱讀,更多相關《5-1 騰訊文本理解系統 TexSmart 中的細粒度實體識別關鍵技術.pdf(60頁珍藏版)》請在三個皮匠報告上搜索。
1、蔣海云騰訊 AI Lab高級研究員|騰訊文本理解系統TexSmart中的細粒度實體識別關鍵技術目錄1 TexSmart系統介紹2 TexSmart細粒度NER概述3 基于知識庫的組合方法4 基于相似實體推斷的遠程監督方法5 基于多源融合的Zero-shot方法|01TexSmart系統介紹|nTexSmart是自然語言理解的工具與服務對中文和英文兩種語言的文本進行詞法、句法和語義分析https:/ 可擴展性十幾種類別 1000多種類別人名 演員、歌手、運動員、節目主持人、作家等太多的類別需要標注大量的訓練數據標注的細粒度訓練數據代價太高u 歧義問題“蘋果 CEO 正在喝 蘋果 汁”公司 or
2、水果?“李娜網球”vs.“李娜唱功”運動員 or 演員?細粒度NER遇到的兩大挑戰:|特色一:細粒度NER語義聯想:對句子中給定的實體,預測與其關聯的實體集合流浪地球 戰狼二、上海堡壘、悲傷逆流成河 特色二:語義聯想一、模型精度和速度的矛盾實現了多種不同的模型和算法精度高和速度快的模型二、從封閉測試環境到開放測試環境利用無標注數據訓練模型:分詞模型、細粒度NER模型三、動態更新模型增量式地收集無標注數據,周期性更新模型特色三:多維度的設計理念實現方法|功能類型一:文本理解功能類型二:文本匹配語義相似度n 基本功能分詞、詞性標注、命名實體識別(NER)、語義聯想、句法分析、語義角色標注、文本分類
3、、關鍵詞提取功能類型三:文本圖譜相似詞、同義詞、反義詞、上位詞、下位詞TexSmart基本功能|TexSmart Demo分詞和標注命名實體識別文本分類句法分析TexSmart Demo|語義角色標注文本匹配TexSmart Demo|文本圖譜TexSmart Demo|02TexSmart細粒度NER概述|粗粒度細粒度超細粒度NER粒度需求超細粒度NER應用更好地理解文本,輔助下游任務(1)關系抽取、知識庫構建、問答系統等理解型任務(2)文本改寫、對話生成、問題生成等生成型任務|超細粒度NER類別樣例“人物”的部分超細粒度實體類型“地點”的超細粒度實體類型TexSmart超細粒度類別體系:h
4、ttps:/ 無監督方法u 無監督和有監督的組合方法細粒度NER:無監督算法無結構文本數據(蘋果,公司)(西瓜,水果)(蘋果,水果)(微軟,公司)抽取(蘋果,pany)(蘋果,food.fruit)(西瓜,food.fruit)(微軟,pany)is-a 數據term-to-type graph構建映射“西瓜 很甜”food.fruit|Texsmart:A text understanding system for fine-grained ner and enhanced semantic analysis,arXiv preprintTexsmart:A system for enhan
5、ced natural language understanding,ACL 2021細粒度NER:無監督算法 Is-a 數據抽取a)人工編寫Is-a模板b)從大量無結構化數據中抽取(蘋果,水果)(西瓜,水果)(蘋果,公司)(微軟,公司)X1、X2等YY諸如X1、X2等 匹配Is-A模板蘋果、西瓜等水果上市公司諸如蘋果和微軟|細粒度NER:無監督算法“蘋果 汁”food.fruitOpany?無結構數據(蘋果,公司)(西瓜,水果)(蘋果,水果)(微軟,公司)抽取(蘋果,pany)(蘋果,food.fruit)(西瓜,food.fruit)(微軟,pany)is-a 數據term-to-type
6、 graph構建映射歧義問題|細粒度NER:無監督算法“蘋果汁”C1,C2打分food.fruit離線在線C1:(蘋果,西瓜,food.fruit)C2:(蘋果,微軟,谷歌,pany)C3:(C+,Java,Python,language.programming)詞向量聚類語料庫term-to-type graph(實體集合,類別)檢索歧義問題的解決方法|細粒度NER:無監督算法無法識別不在詞典內的實體缺點一詞典規模與覆蓋率的折衷 規模大:實體覆蓋率高,但內存消耗大 規模?。簝却嫦男?,但低頻實體無法識別缺點二|細粒度NER:組合算法 基本思想 無監督方法預測細粒度的類別分布 有監督方法預測粗
7、粒度的類別分布 聯合推理出最優的細粒度類別蘋果 CEO無監督有監督聯合模型水果0.5機構名0.6公司0.18=0.6*0.31000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054055056057058059060061062063064065066067068069070071072073074075076077078
8、079080081082083084085086087088089090091092093094095096097098099ACL 2020 Submission*.Confidential Review Copy.DO NOT DISTRIBUTE.TEXSMART:A Text Understanding Platformfor Natural Language ProccessingAnonymous ACL submissionAbstractThis document contains the instructions forpreparing a manuscript for t
9、he proceedings ofACL 2020.The document itself conforms toits own specifications,and is therefore an ex-ample of what your manuscript should looklike.These instructions should be used forboth papers submitted for review and for finalversions of accepted papers.Authors are askedto conform to all the d
10、irections reported in thisdocument.1CreditsThis document has been adapted by Steven Bethard,Ryan Cotterrell and Rui Yan from the instruc-tions for earlier ACL and NAACL proceedings,including those for ACL 2019 by Douwe Kielaand Ivan Vulic,NAACL 2019 by Stephanie Lukinand Alla Roskovskaya,ACL 2018 by
11、 Shay Co-hen,Kevin Gimpel,and Wei Lu,NAACL 2018 byMargaret Michell and Stephanie Lukin,2017/2018(NA)ACL bibtex suggestions from Jason Eisner,ACL 2017 by Dan Gildea and Min-Yen Kan,NAACL 2017 by Margaret Mitchell,ACL 2012by Maggie Li and Michael White,ACL 2010 byJing-Shing Chang and Philipp Koehn,ACL
12、 2008by Johanna D.Moore,Simone Teufel,James Allan,and Sadaoki Furui,ACL 2005 by Hwee Tou Ngand Kemal Oflazer,ACL 2002 by Eugene Char-niak and Dekang Lin,and earlier ACL and EACLformats written by several people,including JohnChen,Henry S.Thompson and Donald Walker.Ad-ditional elements were taken fro
13、m the formattinginstructions of the International Joint Conferenceon Artificial Intelligence and the Conference onComputer Vision and Pattern Recognition.2IntroductionargmaxtPU(t|x)PS(m(t)|x)(1)The following instructions are directed to au-thors of papers submitted to ACL 2020 or acceptedfor publica
14、tion in its proceedings.All authors arerequired to adhere to these specifications.Authorsare required to provide a Portable Document For-mat(PDF)version of their papers.The proceed-ings are designed for printing on A4 paper.3Electronically-available resourcesACL provides this description and accompa
15、nyingstyle files athttp:/acl2020.org/downloads/acl2020-templates.zipWe strongly recommend the use of these style files,which have been appropriately tailored for the ACL2020 proceedings.LATEX-specific details:The templates includethe LATEX2e source(acl2020.tex),the LATEX2estyle file used to format i
16、t(acl2020.sty),an ACLbibliography style(acl natbib.bst),an examplebibliography(acl2020.bib),and the bibliographyfor the ACL Anthology(anthology.bib).4Length of SubmissionThe conference accepts submissions of long papersand short papers.Long papers may consist of upto eight(8)pages of content plus un
17、limited pagesfor references.Upon acceptance,final versions oflong papers will be given one additional page upto nine(9)pages of content plus unlimited pagesfor references so that reviewers comments canbe taken into account.Short papers may consistof up to four(4)pages of content,plus unlimitedpages
18、for references.Upon acceptance,short pa-pers will be given five(5)pages in the proceedingsand unlimited pages for references.For both longand short papers,all illustrations and tables that|細粒度NER:組合算法“上個月30號,南昌王青松在自己家里邊看流浪地球邊吃煲仔飯?!睙o監督模型(上個月30號,time.generic)(南昌,loc.city)(王青松,person.generic)(煲仔飯,food.
19、generic)(流浪地球,work.movie)有監督模型(上個月30號,time.generic)(南昌,loc.city)(煲仔飯,food.generic)(流浪地球,work.movie)(上個月30號,time.generic)(南昌,loc.generic)(王青松,person.generic)(煲仔飯,other)(流浪地球,work.generic)聯合推理人工標注數據(12 類)10000010020030040050060070080090100110120130140150160170180190200210220230240250260270280290300310
20、32033034035036037038039040041042043044045046047048049050051052053054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099ACL 2020 Submission*.Confidential Review Copy.DO NOT DISTRIBUTE.TEXSMART:A Text Understanding Pla
21、tformfor Natural Language ProccessingAnonymous ACL submissionAbstractThis document contains the instructions forpreparing a manuscript for the proceedings ofACL 2020.The document itself conforms toits own specifications,and is therefore an ex-ample of what your manuscript should looklike.These instr
22、uctions should be used forboth papers submitted for review and for finalversions of accepted papers.Authors are askedto conform to all the directions reported in thisdocument.1CreditsThis document has been adapted by Steven Bethard,Ryan Cotterrell and Rui Yan from the instruc-tions for earlier ACL a
23、nd NAACL proceedings,including those for ACL 2019 by Douwe Kielaand Ivan Vulic,NAACL 2019 by Stephanie Lukinand Alla Roskovskaya,ACL 2018 by Shay Co-hen,Kevin Gimpel,and Wei Lu,NAACL 2018 byMargaret Michell and Stephanie Lukin,2017/2018(NA)ACL bibtex suggestions from Jason Eisner,ACL 2017 by Dan Gil
24、dea and Min-Yen Kan,NAACL 2017 by Margaret Mitchell,ACL 2012by Maggie Li and Michael White,ACL 2010 byJing-Shing Chang and Philipp Koehn,ACL 2008by Johanna D.Moore,Simone Teufel,James Allan,and Sadaoki Furui,ACL 2005 by Hwee Tou Ngand Kemal Oflazer,ACL 2002 by Eugene Char-niak and Dekang Lin,and ear
25、lier ACL and EACLformats written by several people,including JohnChen,Henry S.Thompson and Donald Walker.Ad-ditional elements were taken from the formattinginstructions of the International Joint Conferenceon Artificial Intelligence and the Conference onComputer Vision and Pattern Recognition.2Intro
26、ductionargmaxtPU(t|x)PS(m(t)|x)(1)The following instructions are directed to au-thors of papers submitted to ACL 2020 or acceptedfor publication in its proceedings.All authors arerequired to adhere to these specifications.Authorsare required to provide a Portable Document For-mat(PDF)version of thei
27、r papers.The proceed-ings are designed for printing on A4 paper.3Electronically-available resourcesACL provides this description and accompanyingstyle files athttp:/acl2020.org/downloads/acl2020-templates.zipWe strongly recommend the use of these style files,which have been appropriately tailored fo
28、r the ACL2020 proceedings.LATEX-specific details:The templates includethe LATEX2e source(acl2020.tex),the LATEX2estyle file used to format it(acl2020.sty),an ACLbibliography style(acl natbib.bst),an examplebibliography(acl2020.bib),and the bibliographyfor the ACL Anthology(anthology.bib).4Length of
29、SubmissionThe conference accepts submissions of long papersand short papers.Long papers may consist of upto eight(8)pages of content plus unlimited pagesfor references.Upon acceptance,final versions oflong papers will be given one additional page upto nine(9)pages of content plus unlimited pagesfor
30、references so that reviewers comments canbe taken into account.Short papers may consistof up to four(4)pages of content,plus unlimitedpages for references.Upon acceptance,short pa-pers will be given five(5)pages in the proceedingsand unlimited pages for references.For both longand short papers,all i
31、llustrations and tables that|細粒度NER:實驗評價中文數據集:無標注的訓練數據:約400G 粗粒度的標注數據:29K句子數MethodsPrecisionRecallFscoreBase56.2655.0255.68Hybrid72.80 58.8865.10方法:Base:無監督的方法 Hybrid:無監督和有監督相結合|04基于相似實體推斷的遠程監督方法|動機動機 It is challenging to learn effective representations for contextualized mentions in FGET,since the
32、representations are required to well distinguish fine-grained types with similar but different semantics.Existing SOTA models perform poorly on a certain number of“hard”mentions,leading to limited overall performance.First,the structure of some contexts surrounding the hard mentions are inherently t
33、oo complex.Second,the contexts of some hard mentions are ambiguous and thus it is insufficient to handle these mentions by learning from their contexts only|Learning from Sibling Mentions with Scalable Graph Inference in Fine-Grained Entity Typing(Chen et al.,ACL 2022)相似實體:相似實體:Sibling Mentions Sibl
34、ing mentions refer to the mentions that potentially share the same or semantically similar types(e.g.,country and nation)with the target mention.For detecting it,we propose two similarity metrics,based on which we design an effective sibling selection algorithm.Figure 1:Illustration of the proposed
35、approach.|異構圖模型異構圖模型 Two kinds of nodes:Mentions and Types.Three kinds of edges:the sibling relationship between mentions the hierarchical relationship between types the isLabel relationship between mentions and typesFigure 2:Illustration of the graph model.|方法概述方法概述First,a mention-type graph is con
36、structed from training samples.Then,the features for mentions and types are learned by an attentive graph neural module upon.During inference stage,we add test mentions into graph by connecting them to their sibling mentions in the training set.Figure 1:Illustration of the proposed approach.|相似相似Men
37、tion檢測算法檢測算法Word distribution-based metric:Based on the assumption that mentions sharing more contextual words tend to have more similar ground-truth types.We use TF-IDF to encode mentions as sparse feature vectors.Then the sibling similarity between any two mentions is measured by the cosine simila
38、rity of their vectors.Typing distribution-based metric:We first derive the prior score distributions over the type set Y for all the mentions in the dataset from an extra base model(Lin and Ji,2019)trained on the same dataset.Then the sibling mentions are selected by their cosine similarities to the
39、 target mention based on the score distributions.|自注意力圖神經網絡自注意力圖神經網絡We employ graph neural networks(GNNs)with layers to aggregate the information of sibling mentions and types for learning mention representations.Update of type embedding#:Update of mention embedding#:Type Prediction:|Dropout of The
40、representation%()incorporates the information from ground-truth type neighbors.However,it is then used for predicting the ground-truth types in turn.The setting that contains all the ground-truth types will inevitably degenerate the model to just focus on the type neighbors while totally ignore the
41、mention neighbors.To overcome this,each neighboring type in is randomly discarded with a certain probability.In this way,the prediction of discarded type will force the model to learn from the sibling mentions rather than directly from type neighbors.|可擴展的推斷可擴展的推斷Step 1:Given a batch of test mention
42、s,we first obtain their sibling mentions.Step 2:We add the test mentions as nodes into the mention-type graph,where the test mentions are connected to their sibling mentions selected at Step 1.Note that,in the new graph,test mentions have no type neighbors since their ground-truth types are not avai
43、lable.Besides,there is no edge between any two test mentions in the new graph.Step 3:The representations of test mentions*are updated by aggregating the embeddings for their sibling mentions.Through layers of updates,the final representations*()are obtained.Step 4:We predict the type score distribut
44、ion for*,Based on the mention embedding*and the type embeddings().|主要結果主要結果We evaluate the proposed model on two widely-used datasets:OntoNotes and BBN.We consider both the original and the augmented OntoNotes.We select sibling mentions according to the typing distribution from(Lin and Ji(2019).We o
45、bserve that,after aggregating sibling information through the attentive graph neural module,our model significantly outperforms Lin and Ji(2019)on both the original OntoNotes and the BBN dataset.相似相似Mention檢測算法的有效性檢測算法的有效性 Measuring sibling qualityFor each mention%+,denote its ground-truth types as,
46、and sibling mentions in graph as,.Further,for,we denote their ground-truth types as,i.e.,To quantify the effect of different similarity metrics on the quality of siblings,we define Purity,Coverage and Quality similar to the definitions of Precision,Recall and F1.|相似相似Mention檢測算法的有效性檢測算法的有效性 ResultsT
47、he typing-based metric performs better than the word-based metric.The scores from the gold typing-based and the random-based metrics reveal the upper bound and the lower bound of the scores for the typing-based metric|05多源融合的Zero-shot方法|Fine-grained entity typing(FET)FET aims to detect the types of
48、an entity mention given its context.The types usually form a hierarchy.Zero-shot fine-grained entity typing(ZFET)The target types for training and testing are entirely disjoint.SentenceNorthwest and Midway are two of the five airlines with which Budget has agreements.MentionNorthwestType/organizatio
49、n/corporationRootOrganizationGovernmentCorporationPersonActorLocationCity任務定義任務定義|An Empirical Study on Multiple Information Sources for Zero-Shot Fine-Grained Entity Typing(Chen et al.,EMNLP 2021)40 Existing methods Introduce auxiliary information to build the semantic connections between the seen
50、and unseen types.Challenge The power of auxiliary information has not been sufficiently exploited.The effects of each information source remain to be clearly understood.動機動機|Context consistency A correct type should be semantically consistent with the context if the mention is replaced with the type
51、 name in the context.Type hierarchy The ontology structure connecting the seen and unseen types.Source1:Context ConsistencyConsistent:Corporation and Midway are two of the five airlineswith which Budget has agreements.Inconsistent:Drug and Midway are two of the five airlines withwhich Budget has agr
52、eements.Source2:Type HierarchyOrganizationGovernmentCorporationSource3:Background KnowledgePrototypes:western_union,quebecor,merrill,rtc,Description:a business firm whose articles of incorporation havebeen approved in some state.輔助信息源輔助信息源41|Background knowledge Protypes:refer to the carefully selec
53、ted candidate mentions for a type,providing a mention-level summary for types.Descriptions:are queried from WordNet glosses by type names,providing a brief high-level summary for each type.輔助信息源輔助信息源42|Source1:Context ConsistencyConsistent:Corporation and Midway are two of the five airlineswith whic
54、h Budget has agreements.Inconsistent:Drug and Midway are two of the five airlines withwhich Budget has agreements.Source2:Type HierarchyOrganizationGovernmentCorporationSource3:Background KnowledgePrototypes:western_union,quebecor,merrill,rtc,Description:a business firm whose articles of incorporati
55、on havebeen approved in some state.Context-Consistency-Aware(CA)Module Measure the context consistency by large-scale pre-trained language models,e.g.,BERT.Type-Hierarchy-Aware(HA)Module Use Transformer encoder to model the hierarchical dependency among types.Background-Knowledge-Aware(KA)Module Mod
56、el ZFET with knowledge as natural language inference with a translation-based solution.多源融合模型多源融合模型43|MASKandMidwayaretwo.ofCLSSEPBERTdrugcorporationgovernment Loss function=#$!%$!log$!,#,)*=pos Prediction Score=#$!%$!$!,#Context-Consistency-Aware Module(CA)44|OrganizationPersonLocationCorporati
57、onGovernmentCompanyActor Hierarchy-aware type encoder A type only attends to its parent type in the hierarchy and itself,while the attention to the remaining types will be masked.Type-Hierarchy-Aware Module(HA)45|Hierarchy-aware type encoder A type only attends to its parent type in the hierarchy an
58、d itself,while the attention to the remaining types will be masked.Denote the final representation for type as#-./=!0./OtherwiseATT,=softmax0+%Type-Hierarchy-Aware Module(HA)46|Mention-context encoder The entity mention and its context are represented as the weighted sum of their ELMo word represent
59、ations.Then the mention representation$#and context representation%#are concatenated as the final representation:$%=$%where$,%#,$,denotesconcatenation.Type-Hierarchy-Aware Module(HA)47|Prediction score=$%,=$%($%,:$%$%,:Loss function=CrossEntropy,Type-Hierarchy-Aware Module(HA)48|Inference from bac
60、kground knowledgeMultiple Premises Context-based premise:Northwest and Midway are twoof the five airlines with which Budget has agreements.Prototypes-based premise:/organization/corporationhas the following prototypes:western_union,Description-based premise:/organization/corporationdenotes a collect
61、ion of business firms whose.Hypothesis/organization/corporation is a correct type for themention Northwest.Background-Knowledge-Aware Module(KA)49|Encoding multiple premises and the hypothesis Context-based premise$%=$%Prototypes-based premise)=)Description-based premise#=#Hypothesis*=$Multiple Prem
62、ises Context-based premise:Northwest and Midway are twoof the five airlines with which Budget has agreements.Prototypes-based premise:/organization/corporationhas the following prototypes:western_union,Description-based premise:/organization/corporationdenotes a collection of business firms whose.Hy
63、pothesis/organization/corporation is a correct type for themention Northwest.Background-Knowledge-Aware Module(KA)50|Prediction score Project all the representations to the inference space?$%,?),?#,?*=$%,),#,*When the hypothesis can be inferred from the premises,we hope?$%+?)+?#?*Minimize their squa
64、red Euclidean distance=?$%+?)+?#?*+.,./=*0/=*1/=2/=12345467$Background-Knowledge-Aware Module(KA)51|Prediction score The score for type is defined as=Loss function$,-.=Gpos,8neg 0,1 8posBackground-Knowledge-Aware Module(KA)52|Overall loss function=$,2.+$,3.+$,-.Overall prediction score Normalize the
65、 scores from CA,HA and KA according to4=sigmoid5678,The final decision score by the fusion model isscore=94+4+:4.9,+,:0,9+:=1CAHAKAFusionOverall LossMentionContextType多源融合模型多源融合模型53|Datasets Statistics of training and test datasets.Types of both BBN and Wiki are organized into a 2-level hierarchy.Th
66、ere are 47types in BBN and 113 types in Wiki.Zero-shot setting The coarse-grained(Level-1)types such as/organization are used for training(denoted as seen types),while the fine-grained(Level-2)types such as/organization/corporation are reserved for testing(denoted as unseen types).實驗設置實驗設置54|Zero-sh
67、ot performance on the unseen types.主要實驗結果主要實驗結果55|Supervised performance on the seen types.主要實驗結果主要實驗結果56|Ablation results on BBN unseen types(Ma-F1).Ablation Studies57|Impact of context length Performance on the unseen types of BBN relative to the context length.Ablation Studies58|Venn diagram of the test cases of unseen types correctly predicted by each module.Ablation StudiesComplementarity among Different Information Sources59|非常感謝您的觀看|