《基于協作的自解釋 NLP 模.pdf》由會員分享,可在線閱讀,更多相關《基于協作的自解釋 NLP 模.pdf(38頁珍藏版)》請在三個皮匠報告上搜索。
1、On Cooperative Self-Explaining NLP Models2023/06/17Jun WangSummary Interpretability Cooperative Self-Explaining Framework and Spurious Correlations Our insights on Cooperative Games and Solutions for Spurious Correlations in RNP Future workInterpretability-Growing concern about the model interpretab
2、ility in various critical fieldsInterpretability in the LLM EraGPT-3&Beyond,Christopher Potts,Stanford,2023/01Generating post-hoc explanations that seem highly plausibleLLMs remain a huge black box,which may pose a problem for scenarios that require an interpretable underlying mechanism to ensure tr
3、ustworthiness.The processing cost and speed for LLMs when handling vast amounts of data,such as user reviews on large-scale websites,also pose a challenge.Expectation on Interpretability Both faithful(reflecting the models actual behavior)and plausible(aligning with human understanding)Various Metho
4、ds for Interpretability Post-hoc methods require additional surrogate models to explain the existing models being interpreted difficult to ensure faithfulness,especially for black-box models Ante-hoc models(self-explaining)incorporate interpretability into the model design and ensure faithfulness mo
5、del predictions are based on the informative explanations generated by the model itself.Lei et al.2016,Rationalizing Neural Predictions,EMNLP-2016Cooperative Self-Explaining Framework:RNP and Spurious CorrelationsCooperative Self-Explaining Framework:RNP Rationalizing Neural Predictions(RNP)utilizes
6、 a cooperative game between an explainer(or generator)and a predictor the explainer identifies a human-interpretable subset of the input(referred to as rationale)and passes it to the subsequent predictor for making predictions Significant Advantage:Certification of Exclusion Guarantees that any unse
7、lected part of the input has no contribution to the prediction Ensures the maintenance of faithfulnessLei et al.,Rationalizing Neural Predictions,EMNLP-2016Liu et al.,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022Objective of RNPCooperative PredictionRegularizerSparsitycoherency Ratio
8、nale selection:Gumbel-softmax or Reinforcement LearningLei et al.,Rationalizing Neural Predictions,EMNLP-2016Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023RNP is domain-agnostic Liu et al.,Decoupled Rationalization with Asymmetric Learning
9、 Rates:A Flexible Lipschitz Restraint,KDD-2023Luo et al.Parameterized Explainer for Graph Neural Network,MeurIPS-2020Yuan et al.,Interpreting Image Classifiers by Generating Discrete Masks,TPAMI,2022Rationales for TextsRationales for GraphsRationales for ImagesSpurious Correlations in RNPRibeiro et
10、al.,Why Should I Trust You?:Explaining the Predictions of Any Classifier,KDD-2016Liu et al.,MGR:Multi-generator Based Rationalization,ACL-2023Feature correlation:come from the generation process of the original datasetDegeneration(Mask correlation):stems from the rationale(mask)selection in Cooperat
11、ive Gamewolves always appear together with snow!a favorable taste often correlates with an appealing aroma!overfit to meaningless but distinguishable selections!Previous methods to alleviate degeneration using additional regularization modules to make use of the full textsuch that the predictor does
12、 not rely entirely on the rationale provided by the generator.Liu et al.,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022Our insights on Cooperative Games and Solutions for Spurious Correlations in RNP Solution 1:Folded Rationalization(FR)for Degeneration Solution 2:Decoupled Rationaliz
13、ation(DR)for Degeneration Solution 3:Multi-Generator Rationalization(MGR)for Spurious CorrelationsSolution 1:Folded Rationalization(FR)for DegenerationOur Observations on RNPUncoordinated learning paces the rationale quality gets better when the learning rate of the predictor is smaller than the gen
14、erator.Liu et al.2022,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022Learning of the Generator is harder!Folded Rationalization(FR)A frustratingly simple but effective method without additional modules folds the two phases of the current rationalization methods into one using a unified
15、 encoding mechanism.Liu et al.,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022RNPFRFoldMutual reinforcement between the generator and predictor The predictor with the unified encoder has a global view of all the rationale candidates by direct access to input text The predictor is enfor
16、ced with the same learning pace as the generatorLiu et al.,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022Results of FRFR improves the F1 score by up to 10.3%Liu et al.,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022(decorrelated)Solution 2:Decoupled Rationalization(DR)fo
17、r DegenerationFurther Observations on RNP is the proportion between the learning rate of the predictor and the learning rate of the generator.The values in these cells are F1 scores.Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Decoupled R
18、ationalization(DR)Asymmetric Learning Rates for cooperative gamesof RNP Directly make the learning rate of the predictor lower than the learning rate of the generator No modification to the basic framework of RNP Opposite to adversarial games they speed up the critic,while we slow down the predictor
19、 A simple heuristic but empirically effective methodLiu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Lipschitz Continuity A useful indicator of model stability and robustness for various tasks Robustness against adversarial examples Convergen
20、ce stability of Discriminator in GANs(adversarial games)Stability of closed-loop systems with reinforcement learning controllers Reflects surface smoothness of functions corresponding to prediction models For unstable models in optimization,their function surfaces usually have some non-smooth patter
21、ns,such as steep steps or spikes where model outputs may make a large change when the input values change by only a small amount Measured by Lipschitz Constant Smaller Lipschitz constant represents better Lipschitz continuityLiu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexi
22、ble Lipschitz Restraint,KDD-2023Lipschitz ConstantLiu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Intuitions and observations on rationale candidates and are rationale candidates selected from two reviews with clear sentiment tendency opposi
23、te to each other If and are uninformative candidates,then their semantic distance(,)is generally small If and are informative candidates,then their semantic distance(,)is relatively larger.Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Corr
24、elation between degeneration and predictors Lipschitz Continuity Small Lipschitz Constant leading to high likelihood of informative rationale candidates Given any two rationale candidatesand that are selected from the inputting texts and with label=0 and=1If the predictor gives high-confidence predi
25、ctions close to the true labels,it means that and are very small,and and 1-approaches 1.If becomes sufficiently low,then(,)will become sufficiently large.Only informative candidates can obtain large(,),meaning that to achieve large(,),it will inevitably force the generator to select these informativ
26、e candidates as rationales.Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Lower bound of(,)Spectral normalization:Rigid methods Spectral normalization can restrict the Lipschitz constant with some manually selected cutoff values.The rationa
27、le quality is improved.But the prediction performance is impaired.Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Correlation Between Learning Rates and the Lipschitz ConstantLipschitz constant of the predictor exhibits explosive growth when
28、 1.Lipschitz constant are constrained to much smaller values when 1.(The learning rate of the generator fixed at 0.0001)Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023DR gets much smaller Lipschitz constants compared to the RNP.DR:RNP:=1Low
29、er bound Optimizing the predictors parameters would increase 1-,and optimizing the generators parameters would also increase(,).To constrain Lc to be small,it is necessary to make(,)increase faster relative to the increase in 1-.So we slow down the predictor and speed up the generator.(i.e.,1)Result
30、s of DRCompared with FR,there has been further improvement.Liu et al.,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023Analyses on Time efficiency and overfittingLowering the learning rate of the predictor does not slow down the convergence of the train
31、ing process.The training accuracy of RNP grows fast in the beginning when limited rationales have been sampled by the generator,and reflects that the predictor is trying to overfit to these randomly sampled rationales.Although RNP gets a very high accuracy in the training dataset,it does not get acc
32、uracy higher than our method in the development dateset,and also indicates the fact of overfitting.The prediction lossreflects a similar phenomenon.Solution 3:Multi-Generator Rationalization(MGR)for Spurious CorrelationsFeature correlations and degeneration While previous methods might be adept at a
33、ddressing either feature correlations or degeneration,they are typically developed independently,failing to consider both issues simultaneously.We seek to simultaneously solve the two problems.Multi-Generator Rationalization(MGR)First to simultaneously solve the feature correlations and degeneration
34、 problem.Facilitate the predictor to have a broader view of the rationale candidates using multiple generators.Only keep the first generator during inference,which is efficient in terms of both time and resource consumption.Liu et al.,MGR:Multi-generator Based Rationalization,ACL-2023Diverse Trainin
35、g with Separate Learning Rates To facilitate the improvement of the diversity of rationales while guaranteeing the convergence of rationalization models Generators should be different from each other to guarantee that the predictor continuously learns from diverse rationale candidates Different gene
36、rators should be able to achieve the same convergence result Propose separately setting the learning rates of different generators.Learning rate of the i-th generator:i*Learning rate of the predictor:/NLiu et al.,MGR:Multi-generator Based Rationalization,ACL-2023Separate learning rates improve perfo
37、rmance.Keeping only one generator hardly influences the performance,and it indicates that different generators can finally converge to get the same outputs and only one generator is required in inference time.Diversity evolution of rationale candidatesAppearanceAromaPalateDifferent generators achiev
38、e the converged results!Liu et al.,MGR:Multi-generator Based Rationalization,ACL-2023Results of MGRCorrelated BeerAdvocate:MGR achieves an improvement by up to 20.9%as compared to state-of-the-art rationalization methods in terms of F1 score.Liu et al.,MGR:Multi-generator Based Rationalization,ACL-2
39、023Decorrelated BeerAdvocate:MGR achieves performance comparable to state-of-the-art methods,DR and FR,in terms of F1 score.Future work Tackling both types of spurious correlations within a unified framework,founded on causal inference.Expanding insights from cooperative games to other domains,such
40、as GNN.Liu et al.,MGR:Multi-generator Based Rationalization,ACL-2023Reference W Liu,J Wang,H Wang,R Li,Y Qiu,YK Zhang,J Han,Y Zou,Decoupled Rationalization with Asymmetric Learning Rates:A Flexible Lipschitz Restraint,KDD-2023 W Liu,H Wang,J Wang,R Li,X Li,YK Zhang,Y Qiu,MGR:Multi-generator Based Rationalization,ACL-2023 W Liu,H Wang,J Wang,R Li,C Yue,Y Zhang,FR:Folded Rationalization with a Unified Encoder,NeurIPS-2022