當前位置：首頁 > 報告詳情

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf

上傳人：淘*** 編號：650882 2025-04-07 PDF PDF 中文版中文版中文版 DOCX DOCX DOCX 14頁 591.72KB 收藏收藏已收藏上傳文檔翻譯

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第1頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第2頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第3頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第4頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第5頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第6頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第7頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第8頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第9頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第10頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第11頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第12頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第13頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第14頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第1頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第2頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第3頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第4頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第5頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第6頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第7頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第8頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第9頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第10頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第11頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第12頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第13頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第14頁

/ 14

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第1頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第2頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第3頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第4頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第5頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第6頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第7頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第8頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第9頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第10頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第11頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第12頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第13頁

智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf_第14頁

本資源只提供14頁預覽，全部文檔請下載后查看！喜歡就下載吧，查找使用更方便 VIP專享文檔

《智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf》由會員分享，可在線閱讀，更多相關《智譜：GLM-4-Voice技術報告（英文版）（14頁）.pdf（14頁珍藏版）》請在三個皮匠報告上搜索。

1、GLM-4-Voice:Towards Intelligent and Human-LikeEnd-to-End Spoken ChatbotAohan Zeng,Zhengxiao Du,Mingdao Liu,Kedong Wang,Shengmin Jiang,Lei ZhaoYuxiao Dong,Jie TangZhipu.AITsinghua Universityhttps:/ introduce GLM-4-Voice,an intelligent and human-like end-to-end spoken chat-bot.It supports both Chinese

2、 and English,engages in real-time voice conversations,and varies vocal nuances such as emotion,intonation,speech rate,and dialectaccording to user instructions.GLM-4-Voice uses an ultra-low bitrate(175bps),single-codebook speech tokenizer with 12.5Hz frame rate derived from an auto-matic speech reco

3、gnition(ASR)model by incorporating a vector-quantized bottle-neck into the encoder.To effi ciently transfer knowledge from text to speech modal-ities,we synthesize speech-text interleaved data from existing text pre-trainingcorpora using a text-to-token model.We continue pre-training from the pre-tr

4、ainedtext language model GLM-4-9B with a combination of unsupervised speech data,interleaved speech-text data,and supervised speech-text data,scaling up to 1 trilliontokens,achieving state-of-the-art performance in both speech language modelingand spoken question answering.We then fi ne-tune the pre

5、-trained model withhigh-quality conversational speech data,achieving superior performance comparedto existing baselines in both conversational ability and speech quality.The openmodels can be accessed throughhttps:/ https:/huggingface.co/THUDM/glm-4-voice-9b.1IntroductionThe success of large languag

6、e models(LLMs)has driven signifi cant advancements in conversationalAI,enabling the development of text-based chatbots and digital assistants.However,LLMs areprimarily designed to process text input and generate text output,focusing on semantic and logicalcommunication.In contrast,human communicatio

7、n extends beyond semantics,often conveyingemotions and subtle nuances.Voice-based interaction,therefore,provides a more natural andintuitive medium for human-computer interaction,offering richer and more engaging user experiences.Traditional spoken chatbot typically rely on a pipeline combining Auto

8、matic Speech Recognition(ASR),LLM processing,and Text-to-Speech(TTS)synthesis.While functional,this approach isoften hindered by high latency,compounded errors introduced during the ASR and TTS stages,and alimited capacity to capture and express emotional nuances.Speech-language models(SpeechLMs),wh

9、ich process both speech input and output in an end-to-end manner,offer a promising approach for building spoken chatbots.Efforts such as 24,17have explored pre-training on speech data in a manner similar to large language models(LLMs).Similarly,Dfossez et al.12scaled speech data to 7 million hours f

10、or model training.However,*Equal contribution.Email:zah22,zx-Work was done when ML,LZ interned at Zhipu.AI.arXiv:2412.02612v1 cs.CL 3 Dec 2024these approaches face a signifi cant limitation:the relative scarcity of speech data compared to theextensive text corpora available online.This data imbalanc

11、e makes it challenging to fully leveragethe capabilities of text-based LLMs,ultimately constraining the intelligence of SpeechLMs.Othermethods aim to align speech and text modalities 15,42 by integrating a speech encoder and atext-to-speech module into existing LLMs and fi ne-tuning them on spoken d

12、ialogue datasets.Whilethis approach provides a straightforward way to develop speech-to-speech models from LLMs,itlacks the ability to deliver truly human-like speech output due to the absence of dedicated speechpre-training.This limitation hinders these models from capturing the rich nuances and ex

13、pressivenessinherent in human speech.In this paper,we introduce GLM-4-Voice,an intelligent and human-like spoken chatbot.We use asingle code-book supervised speech tokenizer with 12.5Hz frame rate to effi ciently represent speech.A flow-matching-based speech decoder is employed to convert speech tok

14、ens into natural-soundingspeech.To bridge the gap between text and speech modalities,we conduct large-scale speech-textpre-training using 1 trillion tokens.This includes synthetic interleaved speech-text corpora derivedfrom text pre-training data,as well as unsupervised speech data and supervised sp

15、eech-text datasets(e.g.,ASR and TTS).The resulting base model demonstrates strong performance across various tasks,including speech language modeling,spoken question answering,ASR,and TTS.To further enhancethe chatbots conversational capabilities,we fi ne-tune the base model on high-quality conversa

16、tionaldatasets using a streaming thoughts template.This template alternates between outputting textand speech tokens,improving the models ability to generate seamless,low-latency responses whilemaintaining high-quality performance.2Related Work2.1Speech TokenizationSpeech tokenizers,which transform

17、a audio clip into discrete tokens,can be categorized into twodirections.The neural acoustic codecs 44,11,23,20 target at reconstructing high-quality audioat low bitrates.The semantic tokens 19,10 are extracted from speech representations learnedwith self-supervised learning on speech data.Recently,S

18、peechTokenizer 48 and Mini 12 unifysemantic and acoustic tokens as different residual vector quantization(RVQ)layers,but they alsosuffer from multiple tokens at the same position,leading to either parallel prediction of semanticand acoustic tokens,or degradation to semantic tokenizers for language m

19、odels.CosyVoice 14proposes the supervised semantic tokenizer derived from a speech recognition model,and successfullyapply the tokenizer to text-to-speech synthesis.The application of the tokenizer on speech languagemodeling is not explored.2.2Speech Language ModelingSpeech language models are autor

20、egressive models pretrained on unsupervised speech data.Lakhotiaet al.24fi rst proposes generative spoken language modeling(GSLM),which trains the next-token-prediction objective on discrete semantic tokens produced by self-supervised learning.AudioLM 5proposes a hybrid tokenization scheme that comb

21、ines these semantic tokens with acoustic tokensfrom a neural audio codec 44.TWIST 17 trains the speech language model using a warm-startfrom the pretrained text language model OPT 47.Moshi 12 scales up the size of natural speechdata in TWIST to 7 million hours.Spirit-LM 32 further extends TWIST by a

22、dding speech-textinterleaving data curated from speech-text parallel corpus.However,the scarcity of speech-textparallel corpus restricts the scale of interleaving data.2.3End-to-End Spoken ChatbotsEarlyworksinspeech-to-speechmodelsmainlyfocusonprocessingtaskslikespeechtranslation8,2.Since success of

23、 ChatGPT in text-based chatbots,many works have explored methods to developspeech-based chatbots that can understand and respond in speech.SpeechGPT 46 proposes tocombine existing large language models(LLM)with discrete speech representations to obtain speechconversational abilities.Moshi 12 propose

24、s a full-duplex spoken dialogue framework based on theirpretrained speech language model.Qwen-Audio 9 adapts pre-trained textual language models forspeech understanding by aligning speech representations of the Whisper 36 encoder.The model2can understand speech,but not generate speech.Llama-Omni 15

25、and Freeze-Omni 41 extendthe method by adding a text-to-speech model after the language model to transform the text outputto speech output.In this way language models can only control the content of speech,but not thestyles and prosodies.Mini-Omni 42 directly fi ne-tunes language models to generate

26、text and speechresponses simultaneously with only instruction datasets.Without speech pre-training,the quality ofboth text and speech responses is severely limited,as we will show in the experiments.3ArchitectureIn this section,we introduce the architecture of GLM-4-Voice.Our goal is to build a huma

27、n-like,end-to-endspokenchatbotwithhighintelligence.Toachievethis,themodelmust1)comprehendtheusersspeech and provide a semantically accurate response,and 2)follow the users spoken instructions,generating speech with paralinguistic features that meet the users expectations.Inspired by thesuccessful pr

28、e-training and fi ne-tuning paradigm used in LLMs,we believe that these capabilities forspoken chatbots can be best developed through extensive pre-training on diverse speech corpus,ratherthan simply fi ne-tuning existing LLMs with speech question-answering data,as in recent spokenchatbot approaches

29、 15,42.To achieve this goal,GLM-4-Voice is designed with minimal modifi cations to the auto-regressivetransformer architecture.For speech tokenization,we utilize a supervised speech tokenizer,whicheffectively captures semantic information at a ultra-low bitrate(175bps)while maintaining high-quality

30、speech reconstruction.Additionally,we adopt a single-codebook approach for speechtokenization,avoiding the complex architectural adjustments often required for multi-layer speechtoken generation 12,42.This approach helps preserve the models text processing capabilities whileenabling effi cient speec

31、h modeling.Furthermore,the model employs a unifi ed speech representationfor both input and output,enabling next-token prediction for speech data and facilitating effi cientpre-training on unsupervised speech corpora.We use the same speech tokenizer and speech decoder as described in Zeng et al.45.T

32、o enablelow-latency interaction,we adapt the speech decoder to support streaming inference and designa streaming thought template capable of alternating between text and speech tokens during thesupervised fi ne-tuning stage,as detailed in Section 3.3 and Section 3.2.3.1Speech TokenizaionBlock Causal

33、 Self-AttentionFeed-forward NetworkSpeech TokenizerPooling LayerVector QuantizerSpeechSpeech TokensSelf-AttentionFeed-forward NetworkASR DecoderTranscriptionL/2Block Causal Self-AttentionFeed-forward NetworkEmbeddingConditionalFlow MatchingNoiseMel spectrogramsHiFi-GAN VocoderSpeechSpeech TokensCond

34、itionL/2Figure 1:Architecture of the Speech Tokenizerand Speech Decoder for GLM-4-Voice.The speech tokenizer converts continuous wave-forms into discrete speech tokens,which reservesemantic information and a part of acoustic in-formation.Previous methods can be catego-rized into two directions.Acous

35、tic tokenizersare trained with reconstruction/adversarial ob-jectives of speech waveform.Acoustic tokensreserve enough information to reconstruct theoriginal audio,but to represent the additionalinformation it relies on either high sampling rate(i.e.number of tokens per second)or residualvector quan

36、tization 44(i.e.multiple stackedcodebooks).Semantic tokens are extracted fromself-supervised representations learned on auto-matically discovered speech units 19.Seman-tic tokens discard additional information that isunnecessary to represent semantic meaning ofspeech,but also result in low-quality s

37、peech synthesis and a loss of acoustic details 31.The idealspeech tokenizer for speech-text language modeling should have several key features:1)low samplingrate with a single codebook to support autoregressive generation.2)aligning with texts to transferknowledge of pretrained language models.3)sup

38、port of high-quality speech synthesis.We adopt the 12.5Hz speech tokenizer variant described in Zeng et al.45.To make the paperself-contained,we briefly describe the architecture of the speech tokenizer.Inspired by the supervisedsemantic tokenizer in text-to-speech synthesis 14,we fi netune a pretra

39、ined automatic speech3Table 1:Evaluation results of speech tokenizers and decoders.LS stands for LibriSpeech.Evalua-tion on LibriSpeech(English)is measured using word error rate(WER),while AISHELL-1(Chinese)is evaluated using character error rate(CER).We fi ne-tuned the ASR modelwhisper-large-v3with

40、 vector quantization and various pooling layers to create tokenizers with different sampling rates.For further development of GLM-4-Voice,we selected the 12.5 Hz variant.Frame BitRateASRReconstructionRate(bps)LS-clean LS-other AISHELL-1 WER VisQOL MOSNetSpeechTokenizer50Hz1.50K9.971.532.67SpeechToke

41、nizer50Hz4.00K6.323.073.10Moshi(Mimi)12.5Hz1.10K8.362.822.89whisper-large-v350Hz-2.504.539.31SenseVoice-Large50Hz-2.574.282.09GLM-4-Voice-Tokenizer 12.5Hz1752.104.903.028.432.523.3950Hz6001.853.782.706.242.673.3825Hz3001.944.162.866.802.603.336.25Hz10014.412.343.2414.412.343.24recognition model(we u

42、se whisper-large-v3 in the Whisper family 36)with an additional poolinglayer and a vector quantization layer 40 in the middle of the encoder.The codebook vectors arelearned with exponential moving average(EMA)and we reset vectors whose mean usage falls below acertain threshold with randomly-selected

43、 continuous representations before quantization to overcomecodebook collapse following Dhariwal et al.13.Causality for Streaming InferenceTo enable streaming encoding of input speech during inference,we adapt the architecture of Whisper encoder to introduce causality 45.Specifi cally,we replace thec

44、onvolution layer before the encoder Transformer with causal convolution 39.We also replace thebidirectional attention in the encoder with block causal attention.Training DetailsWe fi ne-tune the vector-quantized Whisper model with a collection of ASRdatasets,including LibriSpeech 34,GigaSpeech 7,MLS

45、-Eng 35,Wenet 43,CommonVoice 3,AISHELL-1 6,and a proprietary Chinese ASR dataset of 10k hours.We also include 700k hoursunsupervised speech data with pseudo labels generated by whisper-large-v3 36 for English andparaformer-large 1 for Chinese.All of our speech tokenizers are fi ne-tuned from whisper

46、-large-v3for 2 epochs with batch size 4096 and learning rate 1e-5.The ratio of supervised samples to pseudo-labeled samples is 1:3.The codebook vectors are updated with exponential moving average withdecay coeffi cient 0.99 and the commitment loss coeffi cient is 10.0.To reduce the information loss

47、ofaverage pooling,we increase the codebook size as the sampling rate decreases.EvaluationWe measure the reservation of semantic information in the speech tokens by theaccuracy of the fi netuned ASR model.The results on LibriSpeech 34 and AISHELL-1 6 areshown in Table 1,with whisper-large-v3 36 and S

48、enseVoice-Large 1 as baselines.Overall all thetokenizers reserve enough semantic information to achieve accurate ASR performance.Consideringthe reconstruction results in the following section,we select the 12.5Hz tokenizer for GLM-4-Voice.3.2Speech DecoderThe speech decoder synthesizes speech wavefo

49、rms from discrete speech tokens and is crucial forensuring the quality and expressiveness of generated speech.To minimize latency during speechinteraction,the decoder must also support streaming inference.As in Zeng et al.45,we adopt thedecoder architecture of CosyVoice 14,which comprises a speech t

50、oken encoder,a conditional flowmatching model 28,and a HiFi-GAN vocoder 22.Training DetailsWe train the speech token encoder and the flow matching model from scratch,with a two-stage training paradigm to fully utilize the abundant speech data of varied quality.Duringthe pre-training stage,we use all

51、 the speech samples in the unsupervised speech data of variousspeakers and quality.During the fi ne-tuning stage,we use high-quality speech samples from a singlespeaker.4GLM-4-VoiceSpeech TokenizerSpeech InputUser InputTextSpeechModel OutputSpeechDecoderTextSpeechLatency:20 tokensStage I:Large-scale

52、 Speech-Text Pre-trainingStage II:Supervised Fine-tuning w/“Streaming Thoughts”TemplateText CorpusSynthesize Interleaved DataText-to-Token LMSpeech TokensText TokensTextSpeechTextSpeechSpeechQspeechAspeechAtextFigure 2:Left:Data construction of two training stage of GLM-4-Voice.Right:Model architect

53、ureof GLM-4-Voice.Support for Streaming InferenceTo enable streaming inference and reduce latency,we incorpo-rate truncated audio samples(i.e.,the fi rstn bseconds of the audio,wheren=1,2,3,.,andbisthe block size)during the fi ne-tuning stage.This prepares the model to handle streaming scenarioseffe

54、ctively.During inference,the decoder processes speech tokens corresponding to the fi rstn bseconds of audio.It uses the speech from the initial(n 1)bseconds as the prompt and predictsthe speech content from(n 1)bton bseconds.This approach allows the model to generatespeech tokens with a minimum dela

55、y ofbseconds.Based on empirical studies,we setb=0.8forGLM-4-Voice,which implies that at least 10 speech tokens are required to generate the initial speechoutput.EvaluationWe take the reconstruction results from Zeng et al.45to demonstrate the performanceof our speech decoder with low-bit-rate speech

56、 tokens.We evaluate our speech decoder on speechreconstruction of LibriSpeech 34.and compare our tokenizer with SpeechTokenizer 48 andMini 12.Following Dfossez et al.12,we also evaluate a variant of SpeechTokenizer thatonly keeps the fi rst 3 RVQ layers to obtain a 1.5kbps bitrate.Table 1 shows that

57、 our speechdecoder performs well across various sampling rates,with the 12.5Hz variant offering an optimalbalance between effi ciency and quality.It maintains high quality scores(MOSNet 3.39)and contentpreservation(WER 8.43)while signifi cantly reducing bitrate(175).3.3InferenceDecoupling Speech-to-

58、Speech TaskAn ideal speech language model would operate solely onspeech tokens for direct speech-to-speech tasks.However,given the success of large language modelsand the assumption that text representing the semantic content of most speech,we decouple thespeech-to-speech task into two sub-tasks:spe

59、ech-to-text and speech-and-text-to-speech.Given theusers speech inputQs,the correspond text responseAt,and the speech outputAs,these tasks aredefi ned as follows:Speech-to-Text:The model generates a text response,At,based on the users speech input,Qs.Speech-and-Text-to-Speech:Leveraging bothQsandAt,

60、the model generates spoken output,As,with adaptive tone and prosody to ensure conversational coherence.We adopt the decoupling strategy for the inference process.First,the model generates the textanswerAtbased on the user inputQs,and then generatesAsusing bothQsandAt.In this waythe generation of spe

61、ech responseAsis guided by the text responseAtto improve performance.However,this approach results in a high initial token delay,as it requires waiting for the completegeneration ofAtbefore starting onAs.To address this,we apply a template named called StreamingThoughts.As illustrated in Figure 2,gi

62、venQs,the model alternates between outputting text andspeech tokens at a specifi ed ratio,which are then concatenated to formAtandAs,respectively.Specifi cally,based on our 12.5Hz tokenizer,we alternate between generating 13 text tokens and 26speech tokens.This 1:2 ratio is chosen to ensure that tex

63、t generation is consistently faster than speech.Otherwise,the generated speech tokens would lack the necessary context from the text tokens.Thechoice of 26 speech tokens is based on empirical observations,allowing the model to produce acoherent portion of content before synthesizing it to ensure acc

64、uracy in the synthesized speech.5Overall LatencyThe overall response latency for generating the fi rst speech waveform can becalculated as follows:Speech Tokenization:The users speech input is processed in a streaming manner by the speechtokenizer,which operates on blocks of fi xed sizetblock.Thanks

65、 to the streaming design,thetokenizer begins processing immediately and only requires the time to handle the current block,regardless of the total speech duration.Thus,the tokenization latency is:Tspeech_tokenize=fspeech_tokenize(tblock)LLM Prefilling:The number of speech tokens,Nspeech_tokens,gener

66、ated by the tokenizer is basedon the length of the users speechTuser_speechand the frame ratefr=12.5tokens per second.Theprefi ll latency for the LLM is given by:Tllm_prefi ll=fllm_prefi ll(fr Tuser_speech)LLM Decoding:For the initial audio response,the LLM generates 13 text tokens and 10 speechtoke

67、ns,resulting in a total ofNfi rst_speech=13+10=23tokens.The decoding latency for thisstep is:Tllm_decode=fllm_decode(Nfi rst_speech)Speech Decoding:TheNspeech=10audio tokens are processed by the speech decoder to generatethe fi rst audio chunk.The latency for this step is:Tspeech_decode=fspeech_deco

68、de(Nspeech)The total response latency is then:Ttotal=Tspeech_tokenize+Tllm_prefi ll+Tllm_decode+Tspeech_decode4Training Procedure4.1Stage 1:Joint Speech-Text Pre-trainingWe adopt the same pre-training data and procedure in Zeng et al.45.The primary objective of thisstage is to extend speech modeling

69、 ability to LLM through large-scale speech pre-training.We utilizethree types of speech data:Interleaved speech-text data:Synthesized from text pre-training data as described in Zeng et al.45,these datasets facilitate cross-modal knowledge transfer between text and speech.Unsupervised speech data:Co

70、mprising 700k hours of speech data,this dataset encourages themodel to learn from real-world speech.Supervised speech-text data:Including both ASR and TTS data,this dataset improves the modelscapabilities in basic speech tasks.We also mix text pre-training datasets to maintain text performance.The s

71、tatistics of training data isshown in Table 2.4.1.1Hyper-parametersTable 2:Statistics of training data.#TokensSpeechTextEpochsSpeech-Text455B279B0.90Speech-Only31B-2.10ASR+TTS11B3.5B2.07Text-only-10T0.03WeinitializeGLM-4-VoicefromGLM-4-9B-Base16 and expand its vo-cabulary to include speech tokens.We

72、 performpre-training on 1 trillion tokens,with a fi xedsampling ratio of30%text data,one epocheach of unsupervised speech and supervisedspeech-text data,and the remainder composedof interleaved speech-text data.The composi-tion of the training corpora is detailed in Table 2.We use the AdamW 27 optim

73、izer with1=0.9and2=0.95.The model is trained with asequence length of 8192 and a learning rate that linearly decays from 6 105to 6 106.64.2Stage 2:Supervised Fine-tuning4.2.1Data ConstructionTo create a human-like spoken chatbot,we utilize the following two types of data:Multi-turn conversational sp

74、oken dialogues:These dialogues are primarily derived from text-based data,carefully fi ltered to ensure quality.Code and math-related content are excluded to focuson conversational material suitable for spoken interactions.Responses are refi ned by shorteninglengthy texts and avoiding outputs unsuit

75、able for verbal delivery.Corresponding speech outputsare synthesized to align with the refi ned dialogues.To enhance speech input diversity in real-worldvoice chat scenarios,annotators read and record a variety of speech inputs.Speech style-controlled spoken dialogues:This category contains high-qua

76、lity multi-turn spokendialogues tailored to specifi c speech style requirements,such as speed,emotion,or dialect.4.2.2Training DetailsAs described in Section 3.3,we decouple the speech-to-speech task into two subtasks and employ thestreaming thoughts template to reduce latency.Each conversational tu

77、rn consists of a user speechinputQs,the corresponding text inputQt,a text outputAt,and the corresponding speech outputAs.We observed differing learning curves for the two subtasks.Specifi cally,given a user speech inputQs,the model learns the text outputAtmore quickly and compared to the speech outp

78、utAs.Toaddress this discrepancy,we split each training sample into two components:one focuses on learningthe text output from the speech input by masking the loss for the speech output,while the otherfocuses on learning the speech output from both the speech input and text output by masking the loss

79、for the text output.The model is fi ne-tuned for 20 epochs on speech output and 4 epochs on text output.The learning rateis gradually reduced from1 105to1 106.To mitigate overfi tting,we apply a weight decay of0.1,set a dropout rate of 0.5 for hidden layers,and clip gradients to a maximum value of 1

80、.0.5Evaluation5.1Base Model EvaluationWe evaluate the base model with two speech-text tasks,speech language modeling 5 and spokenquestion answering 30.For both tasks we consider two different settings:from speech contextto speech generation(denoted as SS),and from speech context to text generation,d

81、enoted asST.For all the tasks we synthesis the contexts and continuations with the multi-speaker TTS APIprovided by VolcEngine1.Table 3:Speech Language Modeling results.Results for Spirit-LM are taken from Nguyen et al.32 and other results are from Dfossez et al.12.Modality#ParamsTopic-StoryClozeSto

82、ryClozeTWISTSS7B66.653.3Spirit-LMSS7B82.961.0Spirit-LMST7B88.664.6MoshiSS7B83.060.8GLM-4-VoiceST9B93.676.3GLM-4-VoiceSS9B82.962.4Speech Language ModelingThis tasks evaluates the pretrained models ability to model inter-leaved speech and texts.The model is given a context and required to select the c

83、orrect continuationaccording to the predicted likelihood.We use two datasets proposed by Hassid et al.17,spokenStoryCloze and spokeh Topic-StoryCloze.Both datasets are transformed from the the StoryCloze1https:/ 4:Spoken Question Answering results.Results for baselines are taken from Dfossez et al.1

84、2.Modality#ParamsWeb QuestionsLlama QuestionsTriviaQATWISTSS7B1.54.0-SpeechGPTST7B6.521.614.8SpectronST1B6.121.9-MoshiST7B26.662.322.8MoshiSS7B9.221.07.3GLM-4-VoiceST9B32.264.739.1GLM-4-VoiceSS9B15.950.726.5Table 5:ASR and TTS results.The LibriSpeech(English)is measured with word-error-rate(WER)and

85、AISHELL-1(Chinese)is measured with character-error-rate(CER).The TTS tasks are measuredwith WER.We use to indicate tasks and modalities not supported by the model.LibriSpeechAISHELL-1LibriTTSSeed-TTStest-cleantest-othertesttest-cleantest-entest-zhCosyVoice3.173.393.10whisper-large-v32.504.539.31GLM-

86、4-Voice2.827.662.465.642.912.10textual benchmark 29.The spoken Topic-StoryCloze is easier than spoken StoryCloze.The baselineresults are taken from Dfossez et al.12.Spoken Question AnsweringSimilar to closed-book question answering in NLP,spoken questionanswering requires the speech language model t

87、o answer spoken questions about broad factualknowledge without access to external knowledge base.We evaluate our model on 3 datasets usedin Dfossez et al.12,Web Questions 4,Llama Questions 30,and TriviaQA 21.The baselineresults are taken from Dfossez et al.12.ResultsThe results for speech language m

88、odeling are shown in Table 3 and those for spokenquestion answering are shown in Table 4.We can observe that GLM-4-Voice outperforms baselineson all the evaluated tasks in both SS and ST settings,except Topic-StoryCloze in the SSsetting.Compared with Moshi 12,which also supports both speech and text

89、 modalities,our modelexcels in spoken question answering,whether the answers are textual or spoken.Another observationis that the accuracy in the ST setting is always better than that in the SS setting,especiallyfor spoken question answering.Therefore textual guidance is still necessary for intellig

90、ent speechchatbots.However,our method signifi cantly reduces the gap between spoken answers and textualanswers on spoken question answering,especially on Llama Questions,with the potential to developdirect speech-to-speech chatbots.ASR/TTSWe prompt the base model with the same prompt format used for

91、 the ASR/TTS taskin pre-training.Whisper-Large-V3 36 and Paraformer-Large 38 are employed to generate thetext prediction for English and Chinese recognition in the TTS task respectively.Before computingthe error rate,the text prediction is normalized respectively with tokenizer ofwhisper-large-v3and

92、 CosyVoice 14 pipeline for ASR and TTS tasks.The results are summarized in Table 5.GLM-4-Voice achieve similar ASR and TTS ability compared withwhisper-large-v336 andCosyVoice 14 baselines.5.2Chat Model EvaluationChatGPT ScoreTo evaluate the question answering ability and knowledge memorization of t

93、hefi ne-tuned chat model,we use GPT-4o 33,specifi callygpt-4o-2024-05-13,to evaluate qualityor correctness of the model response.For the General QA task,we adopt the questions from thehelpful baseandvicunasubset of AlpacaEval 25 with math-related questions removed,which8Table 6:Chat model evaluation

94、 results.The baseline results are taken from Zeng et al.45ChatGPT Score UTMOS ASR-WER General QAKnowledgeSpeechGPT 461.402.203.8666.57Mini-Omni 422.441.103.1725.28Llama-Omni 153.503.903.929.18Moshi 122.423.603.907.95GLM-4-Voice5.405.204.455.74follows the chat evaluation dateset of Llama-Omni 15.We a

95、sk GPT-4o to evaluate response qualityand score the response in a range from 1 to 10 following the evaluation method of MT-Bench 49.Forthe Knowledge task,we select 100 questions from Web Questions,Llama Questions,and TriviaQA.We provide GPT-4o with ground-truth answer and ask it to judge whether the

96、 response of the modelis correct.The score reported in Table 6 is the answer accuracy normalized to a scale of 0(0%)to 10(100%).All texts used for judging are audio transcriptions produced by Whisper-Large-V3 36 andthe prompts used for scoring are included in Appendix A.1.Speech QualityWe use the UT

97、MOS 37 model to predict the mean opinion score(MOS)toevaluate the naturalness of the generated speech.Speech-Text AlignmentTo evaluate the correspondence between the generated text responsesand speech responses,we transcribe the speech responses for the General QA task into text withwhipser-large-v3

98、 36.Then,the word error rate(WER)is calculated between the transcription andthe text response,which is referred to as ASR-WER(%)in Table 6.GLM-4-Voice is a bilingualmodel and sometimes answers the English query with a Chinese response,whose WER cannot becalculated directly.For a fair comparison with

99、 the English-only baseline models,we restrict theoutput of GLM-4-Voice to English tokens when evaluating the tasks reported in Table 6.6ConclusionIn this paper,we introduced GLM-4-Voice,an end-to-end spoken chatbot designed for natural andexpressive voice interactions.By integrating a 12.5Hz supveri

100、sed speech tokenizer,a flow-matchingbased speech decoder,and large-scale pre-training on 1 trillion tokens of speech-text data,GLM-4-Voice effectively bridges text and speech modalities.It achieves strong performance across tasks likespeech language modeling,ASR,TTS,and spoken question answering.Fin

101、e-tuning with high-qualityconversational datasets further enhances its ability to generate fluent,low-latency,and nuancedresponses.The open availability of GLM-4-Voice encourages further exploration in building practicaland accessible spoken AI systems.9References1Keyu An,Qian Chen,Chong Deng,Zhihao

102、 Du,Changfeng Gao,Zhifu Gao,Yue Gu,TingHe,Hangrui Hu,Kai Hu,Shengpeng Ji,Yabin Li,Zerui Li,Heng Lu,Haoneng Luo,XiangLv,Bin Ma,Ziyang Ma,Chongjia Ni,Changhe Song,Jiaqi Shi,Xian Shi,Hao Wang,WenWang,Yuxuan Wang,Zhangyu Xiao,Zhijie Yan,Yexin Yang,Bin Zhang,Qinglin Zhang,Shiliang Zhang,Nan Zhao,and Siqi

103、 Zheng.Funaudiollm:Voice understanding and generationfoundation models for natural interaction between humans and llms.CoRR,abs/2407.04051,2024.URL https:/doi.org/10.48550/arXiv.2407.04051.2Junyi Ao,Rui Wang,Long Zhou,Chengyi Wang,Shuo Ren,Yu Wu,Shujie Liu,Tom Ko,QingLi,Yu Zhang,et al.Speecht5:Unifi

104、 ed-modal encoder-decoder pre-training for spoken languageprocessing.In Proceedings of the 60th Annual Meeting of the Association for ComputationalLinguistics(Volume 1:Long Papers),pages 57235738,2022.3Rosana Ardila,Megan Branson,Kelly Davis,Michael Kohler,Josh Meyer,Michael Henretty,Reuben Morais,L

105、indsay Saunders,Francis M.Tyers,and Gregor Weber.Common voice:Amassively-multilingual speech corpus.In Proceedings of The 12th Language Resources andEvaluation Conference,LREC 2020,Marseille,France,May 11-16,2020,pages 42184222.European Language Resources Association,2020.4Jonathan Berant,Andrew Cho

106、u,Roy Frostig,and Percy Liang.Semantic parsing on freebasefrom question-answer pairs.In Proceedings of the 2013 Conference on Empirical Methods inNatural Language Processing,EMNLP 2013,18-21 October 2013,Grand Hyatt Seattle,Seattle,Washington,USA,A meeting of SIGDAT,a Special Interest Group of the A

107、CL,pages 15331544.ACL,2013.5Zaln Borsos,Raphal Marinier,Damien Vincent,Eugene Kharitonov,Olivier Pietquin,Matthew Sharifi,Dominik Roblek,Olivier Teboul,David Grangier,Marco Tagliasacchi,andNeil Zeghidour.Audiolm:A language modeling approach to audio generation.IEEE ACMTrans.Audio Speech Lang.Process

108、.,31:25232533,2023.6Hui Bu,Jiayu Du,Xingyu Na,Bengu Wu,and Hao Zheng.AISHELL-1:an open-sourcemandarin speech corpus and a speech recognition baseline.In 20th Conference of the OrientalChapter of the International Coordinating Committee on Speech Databases and Speech I/OSystems and Assessment,O-COCOS

109、DA 2017,Seoul,South Korea,November 1-3,2017,pages15.IEEE,2017.7Guoguo Chen,Shuzhou Chai,Guan-Bo Wang,Jiayu Du,Wei-Qiang Zhang,Chao Weng,DanSu,Daniel Povey,Jan Trmal,Junbo Zhang,Mingjie Jin,Sanjeev Khudanpur,Shinji Watanabe,Shuaijiang Zhao,Wei Zou,Xiangang Li,Xuchen Yao,Yongqing Wang,Zhao You,and Zhi

110、yongYan.Gigaspeech:An evolving,multi-domain ASR corpus with 10,000 hours of transcribedaudio.In 22nd Annual Conference of the International Speech Communication Association,Interspeech 2021,Brno,Czechia,August 30-September 3,2021,pages 36703674.ISCA,2021.8Yi-Chen Chen,Po-Han Chi,Shu-wen Yang,Kai-Wei

111、 Chang,Jheng-hao Lin,Sung-Feng Huang,Da-Rong Liu,Chi-Liang Liu,Cheng-Kuang Lee,and Hung-yi Lee.Speechnet:A universalmodularized model for speech processing tasks.arXiv preprint arXiv:2105.03070,2021.9Yunfei Chu,Jin Xu,Xiaohuan Zhou,Qian Yang,Shiliang Zhang,Zhijie Yan,Chang Zhou,andJingren Zhou.Qwen-

112、audio:Advancing universal audio understanding via unifi ed large-scaleaudio-language models.CoRR,abs/2311.07919,2023.10Yu-An Chung,Yu Zhang,Wei Han,Chung-Cheng Chiu,James Qin,Ruoming Pang,andYonghui Wu.w2v-bert:Combining contrastive learning and masked language modeling forself-supervised speech pre

113、-training.In IEEE Automatic Speech Recognition and UnderstandingWorkshop,ASRU 2021,Cartagena,Colombia,December 13-17,2021,pages 244250.IEEE,2021.11Alexandre Dfossez,Jade Copet,Gabriel Synnaeve,and Yossi Adi.High fi delity neural audiocompression.Trans.Mach.Learn.Res.,2023,2023.1012Alexandre Dfossez,

114、Laurent Mazar,Manu Orsini,Amlie Royer,Patrick Prez,Herv Jgou,Edouard Grave,and Neil Zeghidour.Moshi:a speech-text foundation model for real-timedialogue.Technical report,Kyutai,September 2024.URLhttp:/kyutai.org/Moshi.pdf.13Prafulla Dhariwal,Heewoo Jun,Christine Payne,Jong Wook Kim,Alec Radford,and

115、IlyaSutskever.Jukebox:A generative model for music.CoRR,abs/2005.00341,2020.14Zhihao Du,Qian Chen,Shiliang Zhang,Kai Hu,Heng Lu,Yexin Yang,Hangrui Hu,SiqiZheng,Yue Gu,Ziyang Ma,Zhifu Gao,and Zhijie Yan.Cosyvoice:A scalable multilingualzero-shot text-to-speech synthesizer based on supervised semantic

116、 tokens,2024.URL https:/arxiv.org/abs/2407.05407.15Qingkai Fang,Shoutao Guo,Yan Zhou,Zhengrui Ma,Shaolei Zhang,and Yang Feng.Llama-omni:Seamless speech interaction with large language models,2024.URLhttps:/arxiv.org/abs/2409.06666.16Team GLM,Aohan Zeng,Bin Xu,Bowen Wang,Chenhui Zhang,Da Yin,Dan Zhan

117、g,DiegoRojas,Guanyu Feng,Hanlin Zhao,Hanyu Lai,Hao Yu,Hongning Wang,Jiadai Sun,JiajieZhang,Jiale Cheng,Jiayi Gui,Jie Tang,Jing Zhang,Jingyu Sun,Juanzi Li,Lei Zhao,LindongWu,Lucen Zhong,Mingdao Liu,Minlie Huang,Peng Zhang,Qinkai Zheng,Rui Lu,ShuaiqiDuan,Shudan Zhang,Shulin Cao,Shuxun Yang,Weng Lam Ta

118、m,Wenyi Zhao,Xiao Liu,XiaoXia,Xiaohan Zhang,Xiaotao Gu,Xin Lv,Xinghan Liu,Xinyi Liu,Xinyue Yang,Xixuan Song,Xunkai Zhang,Yifan An,Yifan Xu,Yilin Niu,Yuantao Yang,Yueyan Li,Yushi Bai,YuxiaoDong,Zehan Qi,Zhaoyu Wang,Zhen Yang,Zhengxiao Du,Zhenyu Hou,and Zihan Wang.Chatglm:A family of large language mo

119、dels from glm-130b to glm-4 all tools,2024.URLhttps:/arxiv.org/abs/2406.12793.17Michael Hassid,Tal Remez,Tu Anh Nguyen,Itai Gat,Alexis Conneau,Felix Kreuk,Jade Copet,Alexandre Dfossez,Gabriel Synnaeve,Emmanuel Dupoux,Roy Schwartz,and Yossi Adi.Textually pretrained speech language models.In Advances

120、in Neural Information ProcessingSystems 36:Annual Conference on Neural Information Processing Systems 2023,NeurIPS2023,New Orleans,LA,USA,December 10-16,2023,2023.18Andrew Hines,Jan Skoglund,Anil Kokaram,and Naomi Harte.Visqol:an objective speechquality model.EURASIP Journal on Audio,Speech,and Musi

121、c Processing,2015(13):118,2015.19Wei-Ning Hsu,Benjamin Bolte,Yao-Hung Hubert Tsai,Kushal Lakhotia,Ruslan Salakhutdinov,and Abdelrahman Mohamed.Hubert:Self-supervised speech representation learning by maskedprediction of hidden units.IEEE ACM Trans.Audio Speech Lang.Process.,29:34513460,2021.20Shengp

122、eng Ji,Ziyue Jiang,Xize Cheng,Yifu Chen,Minghui Fang,Jialong Zuo,Qian Yang,Ruiqi Li,Ziang Zhang,Xiaoda Yang,Rongjie Huang,Yidi Jiang,Qian Chen,Siqi Zheng,WenWang,and Zhou Zhao.Wavtokenizer:an effi cient acoustic discrete codec tokenizer for audiolanguage modeling.CoRR,abs/2408.16532,2024.21Mandar Jo

123、shi,Eunsol Choi,Daniel S.Weld,and Luke Zettlemoyer.Triviaqa:A large scaledistantly supervised challenge dataset for reading comprehension.In Proceedings of the55th Annual Meeting of the Association for Computational Linguistics,ACL 2017,Vancouver,Canada,July 30-August 4,Volume 1:Long Papers,pages 16

124、011611.Association forComputational Linguistics,2017.22Jungil Kong,Jaehyeon Kim,and Jaekyoung Bae.Hifi-gan:Generative adversarial networks foreffi cient and high fi delity speech synthesis.In H.Larochelle,M.Ranzato,R.Hadsell,M.F.Bal-can,and H.Lin,editors,Advances in Neural Information Processing Sys

125、tems,volume 33,pages1702217033.Curran Associates,Inc.,2020.URLhttps:/proceedings.neurips.cc/paper_files/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf.23Rithesh Kumar,Prem Seetharaman,Alejandro Luebs,Ishaan Kumar,and Kundan Kumar.High-fi delity audio compression with improved RVQGAN.In A

126、dvances in Neural InformationProcessing Systems 36:Annual Conference on Neural Information Processing Systems 2023,NeurIPS 2023,New Orleans,LA,USA,December 10-16,2023,2023.1124Kushal Lakhotia,Eugene Kharitonov,Wei-Ning Hsu,Yossi Adi,Adam Polyak,Benjamin Bolte,Tu-Anh Nguyen,Jade Copet,Alexei Baevski,

127、Abdelrahman Mohamed,and Emmanuel Dupoux.On generative spoken language modeling from raw audio.Transactions of the Association forComputational Linguistics,9:13361354,2021.25Xuechen Li,Tianyi Zhang,Yann Dubois,Rohan Taori,Ishaan Gulrajani,Carlos Guestrin,PercyLiang,andTatsunoriB.Hashimoto.Alpacaeval:

128、Anautomaticevaluatorofinstruction-followingmodels.https:/ 2023.26Chen-Chou Lo,Szu-Wei Fu,Wen-Chin Huang,Xin Wang,Junichi Yamagishi,Yu Tsao,andHsin-Min Wang.Mosnet:Deep learning-based objective assessment for voice conversion.In Gernot Kubin and Zdravko Kacic,editors,20th Annual Conference of the Int

129、ernationalSpeech Communication Association,Interspeech 2019,Graz,Austria,September 15-19,2019,pages 15411545.ISCA,2019.doi:10.21437/INTERSPEECH.2019-2003.URLhttps:/doi.org/10.21437/Interspeech.2019-2003.27Ilya Loshchilov and Frank Hutter.Decoupled weight decay regularization,2019.URLhttps:/arxiv.org

130、/abs/1711.05101.28Shivam Mehta,Ruibo Tu,Jonas Beskow,va Szkely,and Gustav Eje Henter.Matcha-TTS:Afast TTS architecture with conditional flow matching.In Proc.ICASSP,2024.29Nasrin Mostafazadeh,Nathanael Chambers,Xiaodong He,Devi Parikh,Dhruv Batra,LucyVanderwende,Pushmeet Kohli,and James F.Allen.A co

131、rpus and evaluation framework fordeeper understanding of commonsense stories.CoRR,abs/1604.01696,2016.30Eliya Nachmani,Alon Levkovitch,Roy Hirsch,Julian Salazar,Chulayuth Asawaroengchai,Soroosh Mariooryad,Ehud Rivlin,R.J.Skerry-Ryan,and Michelle Tadmor Ramanovich.Spoken question answering and speech

132、 continuation using spectrogram-powered LLM.In TheTwelfth International Conference on Learning Representations,ICLR 2024,Vienna,Austria,May 7-11,2024.OpenR,2024.31Tu Anh Nguyen,Wei-Ning Hsu,Antony DAvirro,Bowen Shi,Itai Gat,Maryam Fazel-Zarandi,Tal Remez,Jade Copet,Gabriel Synnaeve,Michael Hassid,Fe

133、lix Kreuk,Yossi Adi,and Emmanuel Dupoux.Expresso:A benchmark and analysis of discrete expressive speechresynthesis.In Naomi Harte,Julie Carson-Berndsen,and Gareth Jones,editors,24th AnnualConference of the International Speech Communication Association,Interspeech 2023,Dublin,Ireland,August 20-24,20

134、23,pages 48234827.ISCA,2023.32Tu Anh Nguyen,Benjamin Muller,Bokai Yu,Marta R.Costa-jussa,Maha Elbayad,SravyaPopuri,Paul-Ambroise Duquenne,Robin Algayres,Ruslan Mavlyutov,Itai Gat,Gabriel Syn-naeve,Juan Pino,Benoit Sagot,and Emmanuel Dupoux.Spirit-lm:Interleaved spoken andwritten language model,2024.

135、URL https:/arxiv.org/abs/2402.05755.33 OpenAI.Hello gpt-4o,2024.URL https:/ Panayotov,Guoguo Chen,Daniel Povey,and Sanjeev Khudanpur.Librispeech:Anasr corpus based on public domain audio books.In 2015 IEEE International Conference onAcoustics,Speech and Signal Processing(ICASSP),pages 52065210,2015.

136、doi:10.1109/ICASSP.2015.7178964.35Vineel Pratap,Qiantong Xu,Anuroop Sriram,Gabriel Synnaeve,and Ronan Collobert.MLS:A large-scale multilingual dataset for speech research.In 21st Annual Conference of theInternational Speech Communication Association,Interspeech 2020,Virtual Event,Shanghai,China,Octo

137、ber 25-29,2020,pages 27572761.ISCA,2020.36Alec Radford,Jong Wook Kim,Tao Xu,Greg Brockman,Christine McLeavey,and IlyaSutskever.Robust speech recognition via large-scale weak supervision.In Andreas Krause,Emma Brunskill,Kyunghyun Cho,Barbara Engelhardt,Sivan Sabato,and Jonathan Scarlett,editors,Inter

138、national Conference on Machine Learning,ICML 2023,23-29 July 2023,Honolulu,Hawaii,USA,volume 202 of Proceedings of Machine Learning Research,pages 2849228518.PMLR,2023.1237Takaaki Saeki,Detai Xin,Wataru Nakata,Tomoki Koriyama,Shinnosuke Takamichi,andHiroshi Saruwatari.Utmos:Utokyo-sarulab system for

139、 voicemos challenge 2022.Interspeech2022,2022.38Xian Shi,Yexin Yang,Zerui Li,and Shiliang Zhang.Seaco-paraformer:A non-autoregressiveasr system with flexible and effective hotword customization ability.arXiv preprintarXiv:2308.03266(accepted by ICASSP2024),2023.39Aron van den Oord,Sander Dieleman,He

140、iga Zen,Karen Simonyan,Oriol Vinyals,AlexGraves,Nal Kalchbrenner,Andrew W.Senior,and Koray Kavukcuoglu.Wavenet:A generativemodel for raw audio.In The 9th ISCA Speech Synthesis Workshop,SSW 2016,Sunnyvale,CA,USA,September 13-15,2016,page 125.ISCA,2016.40Aron van den Oord,Oriol Vinyals,and Koray Kavuk

141、cuoglu.Neural discrete representationlearning.In Isabelle Guyon,Ulrike von Luxburg,Samy Bengio,Hanna M.Wallach,Rob Fergus,S.V.N.Vishwanathan,and Roman Garnett,editors,Advances in Neural Information ProcessingSystems 30:Annual Conference on Neural Information Processing Systems 2017,December4-9,2017,

142、Long Beach,CA,USA,pages 63066315,2017.41Xiong Wang,Yangze Li,Chaoyou Fu,Lei Xie,Ke Li,Xing Sun,and Long Ma.Freeze-omni:A smart and low latency speech-to-speech dialogue model with frozen llm,2024.URLhttps:/arxiv.org/abs/2411.00774.42Zhifei Xie and Changqiao Wu.Mini-omni:Language models can hear,talk

143、 while thinking instreaming,2024.URL https:/arxiv.org/abs/2408.16725.43Zhuoyuan Yao,Di Wu,Xiong Wang,Binbin Zhang,Fan Yu,Chao Yang,Zhendong Peng,Xiaoyu Chen,Lei Xie,and Xin Lei.Wenet:Production oriented streaming and non-streamingend-to-end speech recognition toolkit.In 22nd Annual Conference of the

144、 International SpeechCommunication Association,Interspeech 2021,Brno,Czechia,August 30-September 3,2021,pages 40544058.ISCA,2021.44Neil Zeghidour,Alejandro Luebs,Ahmed Omran,Jan Skoglund,and Marco Tagliasacchi.Soundstream:An end-to-end neural audio codec.IEEE ACM Trans.Audio Speech Lang.Process.,30:

145、495507,2022.doi:10.1109/TASLP.2021.3129994.URLhttps:/doi.org/10.1109/TASLP.2021.3129994.45Aohan Zeng,Zhengxiao Du,Mingdao Liu,Lei Zhang,Shengmin Jiang,Yuxiao Dong,andJie Tang.Scaling speech-text pre-training with synthetic interleaved data,2024.URLhttps:/arxiv.org/abs/2411.17607.46Dong Zhang,Shimin

146、Li,Xin Zhang,Jun Zhan,Pengyu Wang,Yaqian Zhou,and XipengQiu.Speechgpt:Empowering large language models with intrinsic cross-modal conversationalabilities,2023.URL https:/arxiv.org/abs/2305.11000.47Susan Zhang,Stephen Roller,Naman Goyal,Mikel Artetxe,Moya Chen,Shuohui Chen,Christopher Dewan,Mona T.Di

147、ab,Xian Li,Xi Victoria Lin,Todor Mihaylov,Myle Ott,SamShleifer,Kurt Shuster,Daniel Simig,Punit Singh Koura,Anjali Sridhar,Tianlu Wang,and LukeZettlemoyer.OPT:open pre-trained transformer language models.CoRR,abs/2205.01068,2022.48Xin Zhang,Dong Zhang,Shimin Li,Yaqian Zhou,and Xipeng Qiu.Speechtokeni

148、zer:Unifi edspeech tokenizer for speech language models.In The Twelfth International Conference onLearning Representations,ICLR 2024,Vienna,Austria,May 7-11,2024.OpenR,2024.49Lianmin Zheng,Wei-Lin Chiang,Ying Sheng,Siyuan Zhuang,Zhanghao Wu,Yonghao Zhuang,Zi Lin,Zhuohan Li,Dacheng Li,Eric P.Xing,Hao

149、 Zhang,Joseph E.Gonzalez,and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena,2023.URLhttps:/arxiv.org/abs/2306.05685.13AAppendixA.1Prompt for Evaluating Spoken ChatbotsGeneral QAInstructionPlease act as an impartial judge and evaluate the quality of the response providedby an AI as

150、sistant to the user question displayed below.Your evaluation shouldconsider factors such as the helpfulness,relevance,accuracy,depth,creativity,and level of detail of the response.Begin your evaluation by providing a shortexplanation.Be as objective as possible.After providing your explanation,youmu

151、st rate the response on a scale of 1 to 10 by strictly following this format:rating,for example:Rating:5.QuestioninstructionThe Start of Assistants AnswerresponseThe End of Assistants AnswerKnowledgeYour will be given a question,the reference answers to that question,and ananswer to be judged.Your t

152、asks is to judge whether the answer to be judged iscorrect,given the question and reference answers.An answer considered correctexpresses or contains the same meaning as at least*one of*the referenceanswers.The format and the tone of the response does not matter.You should respond in JSON format.First provide a one-sentence concise analysisfor the judgement in field analysis,then your judgment in field judgment.Forexample,jsonanalysis:,judgment:#Questioninstruction#Reference Answertargets#Answer To Be Judgedanswer_to_be_judged14

聯系我們

0731-84720580
sgpjbg002
工作日 9:30 - 18:00

關于我們

侵權處理

關于我們

出版物經營許可證
工信部備案號：湘ICP備17000430號-2
公安備案號：湘公網安備43010402001071號

三個皮匠報告專業的行業報告下載站，每日更新，歡迎大家關注！

copyright@2008-2013 長沙景略智創信息技術有限公司版權所有
網站備案/許可證號：湘B2-20190120

客服

小程序

服務號

折疊

午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站