《多說話人分離技術及應用進展-洪青陽.pdf》由會員分享,可在線閱讀,更多相關《多說話人分離技術及應用進展-洪青陽.pdf(21頁珍藏版)》請在三個皮匠報告上搜索。
1、洪青陽合作者:余洪涌、姜躍猛、李朝陽、王捷、李琳廈門大學智能語音實驗室2024.3多說話人分離技術及應用進展綱 要1.研究背景2.工業版本模塊化系統3.改進方案4.落地應用1.研究背景多說話人分離(說話人日志):給定一個包含多人交替說話的語音,系統需要判斷每個時間段是誰在說話。多說話人分離系統音頻分割信息1.研究背景應用場景:會議紀要,多說話人轉錄,智能客服,錄音質檢等.終端設備:智能手機個人電腦錄音筆支持廠商:科大訊飛(智能辦公本)、華為(AI紀要)、聲云(語音轉寫).1.研究背景端到端架構模塊化架構研究趨勢:簡單場景復雜場景2000 200220062009 2013 2018 2019
2、2020 2021 2022 2023競賽/數據集Rich Transcription(RT)AMICALLHOMEDIHARD(I,II,III)CHiME-6M2MeT,AISHELL-4架構MIXER6挑戰:噪聲干擾,人數未知,語音重疊等應用:離線=在線,單麥克風=多麥克風,適配新場景VoxSRC(20,21,22,23)M2MeT2.0,CHiME-7AliMeeting1.研究背景模塊化系統聚類方法:AHC1、SC2,3、VB/VBx4,5、UIS-RNN6、DNC7 1 K.C.Gowda and G.Krishna,“Agglomerative Clustering Using
3、the Concept of Mutual Nearest Neighbourhood,”Pattern Recognition,vol.10,pp.105112,1978.2 U.von Luxburg,“Atutorial on spectral clustering,”Statistics and Computing,vol.17,pp.395416,2007.3 T.Park,Kyu J.Han,Manoj Kumar,and Shrikanth S.Narayanan,“Auto-tuning Spectral Clustering for Speaker Diarization U
4、sing Normalized Maximum Eigengap,”IEEE SignalProcessing Letters,vol.27,pp.381385,2020.4 M.Diez,L.Burget,S.Wang,J.Rohdin,H.Cernocky,“Bayesian HMM based x-vector Clustering for Speaker Diarization,”Interspeech,2019,pp.346-350.5 M.Diez,L.Burget,F.Landini,J.Cernocky,Analysis of Speaker Diarization based
5、 on Bayesian HMM with Eigenvoice Priors,IEEE/ACM Transactions on Audio Speech andLanguage Processing,vol.28,p 355-368,2020.6A.Zhang,Q.Wang,Z.Zhu,J.Paisley,and C.Wang,“Fully Supervised Speaker Diarization,”ICASSP,2019.7 Q.J.Li,F.L.Kreyssig,C.Zhang,P.C.Woodland,“Discriminative Neural Clustering for Sp
6、eaker Diarisation,”IEEE Spoken Language Technology Workshop(SLT 2021),Jan 2021,Shenzhen,China.1.研究背景端到端系統端到端系統EEND1SA-EEND2TS-VAD4基于Bi-LSTM的端到端模型目標說話人音頻端點檢測模型1 Y.Fujita,N.Kanda,S.Horiguchi,K.Nagamatsu,and S.Watanabe,“End-to-end Neural Speaker Diarization with Permutation-free Objectives,”in Interspe
7、ech,2019,pp.43004304.2 Y.Fujita,N.Kanda,S.Horiguchi,Y.Xue,K.Nagamatsu and S.Watanabe,“End-to-End Neural Speaker Diarization with Self-Attention,”2019 IEEE Automatic Speech Recognitionand Understanding Workshop(ASRU),SG,Singapore,2019,pp.296-303.3 S.Horiguchi,Y.Fujita,S.Watanabe,Y.Xue,and K.Nagamatsu
8、,“End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,”inInterspeech,2020,pp.269273.4 I.Medennikov,M.Korenevsky,et al.,“Target-speaker Voice Activity Detection:a Novel Approach for Multi-speaker Diarization in a Dinner Party Scenario,”arXiv,vol.abs/
9、2005.07272,2020.基于Transformer encoder的端到端模型EDA-EEND3可以預測人數的EEND模型1.研究背景聚類算法匯總聚類算法訓練方式輸入特征重疊檢測預測人數AHC無監督聚類x-vector不支持閾值VB無監督聚類i-vector不支持初始化調節VBx無監督聚類x-vector不支持初始化調節SC無監督聚類x-vector不支持閾值/NMEUIS-RNN有監督聚類d-vector不支持適合2人DNC有監督聚類d-vector支持輸出節點EEND有監督聚類聲學特征支持輸出節點TS-VAD有監督聚類i-vector支持輸出節點在線版本:研究主要集中在EEND1,
10、2或UIS-RNN3,4框架麥陣版本:多通道輸入TS-VAD5或前后端聯合優化特定場景:不同場景采用不同策略61 Y.Xue,S.Horiguchi,Y.Fujita,S.Watanabe,P.Garcia,and K.Nagamatsu,“Online end-to-end neural diarization with speaker-tracing buffer,”in IEEE Spoken LanguageTechnology Workshop(SLT),2021,pp.841848.2 E.Han,C.Lee,and A.Stolcke,“Bw-eda-eend:Streaming
11、 end-toend neural speaker diarization for a variable number of speakers,”in ICASSP,2021.3 E.Fini andA.Brutti,“Supervised online diarization with sample mean loss for multi-domain data,”in ICASSP,2020,pp.71347138.4 X.Wan,K.Liu,H.Zhou,Online speaker diarization equipped with discriminative modeling an
12、d guided inference,”in Interspeech,2021.5 I.Medennikov,M.Korenevsky,et al.,“Target-speaker Voice Activity Detection:a Novel Approach for Multi-speaker Diarization in a Dinner Party Scenario,”arXiv,vol.abs/2005.07272,2020.6 Y.-X.Wang,J.Du,M.-K.He,S.-T.Niu,L.Sun,C.-H.Lee,Scenario-dependent speaker dia
13、rization for DIHARD-III challgenge,in Interspeech,2021.功能:轉換為聚類問題2.工業版本模塊化系統2.1 分割音頻功能:提取段級別說話人表征i-vectord-vectorx-vector2.工業版本模塊化系統2.2 提取說話人表征功能:對相同說話人片段聚類AHC2.工業版本模塊化系統2.3 聚類凝聚層次聚類(AHC)K.C.Gowda and G.Krishna,“Agglomerative Clustering Using the Concept of Mutual Nearest Neighbourhood,”Pattern Reco
14、gnition,vol.10,pp.105112,1978.2.工業版本模塊化系統第一代產品(與ASV-Subtools*結合)說話人日志(SD)語音識別(ASR)語音端點檢測(VAD)識別后處理原始音頻說話人1說話人2說話人3說話人4算法流程:VAD-平均分割-Subtools提取x-vector-PCA降維-Cosine打分-AHC聚類*https:/ 2.1 speaker diarization pipeline:principle,benchmark,and recipe,”Interspeech 2023.提取x-vector時,去除重疊語音,并合并同一人語音解決辦法:進行分段,每
15、段用神經網絡判斷說話人,最多3人。3.改進方案完整流程Pyannote神經網絡分割Subtools提取x-vectorAHC聚類說話人分離結果5s根據聚類結果分配每個說話人音頻,包括重疊語音處理ONNX模型2推理1 https:/ https:/ LSTM Feed forward Classifier3.改進方案C代碼改寫存在問題Python代碼太慢(實時率RTF遠大于1),難以實用全部改用C代碼實現(實時率RTF:0.030.06)運行環境:Intel(R)Xeon(R)CPU E5-2630 v4 2.20GHz4.落地應用聲云語音轉寫(各大應用市場可下載)新版本使用效果:日轉寫時長大約 330-350小時,0投訴。4.落地應用聲云語音轉寫(各大應用市場可下載)自研引擎的優點:普通話、帶角色分離、1小時耗時4分鐘(友商耗時5分鐘、且晚上經常擁堵)我們支持大于5小時的離線音頻任務(友商最大支持5小時,超5小時無法上傳)匯報完畢,敬請指正!廈門大學智能語音實驗室