Speech Language Models

5 minute read


With the popularity of language modeling, there have been many advances in speech language models leveraing their in-context learning capability in speech synthesis.

The complexity of modeling speech signls lies in the manipulation and computation of continous-valued waveforms or features. Recent works tackle it with pretrained speech tokenizers, and convert speech into discrete tokens (also called units), and here is a blog about popular speech tokenizers. Techniques used in text language modeling can be easily applied to unit language modeling for speech. Below we introduce some recent works on speech language modelings and their applications. In particular, we list the speech units and training approaches adopted by each work.

Generative Spoken Language Modeling (GSLM)[1]

Speech units: HuBERT units.


GSLM is a multi-stream language model which takes in HuBERT units, duration and pitch features, and predict next token of each stream.


Speech units: EnCodec units.

Training: it uses a decoder-only model to predict target acoustic tokens conditioned on source phoneme sequence, target phoneme sequence and source acoustic tokens. Note that the EnCodec units are from 8 codebooks, VALL-E uses autoregressive decoder to predict the first-codebook tokens, and non-autoregressive deocder to predict the remaining codebooks.

VALL-E X is a multilingual extension of VALLE, which adds source and target language tags to the input and enables voice transfer across different languages.

Application: text-to-speech synthesis with voice preservation.


Speech units:

(1) Semantic units by w2v-BERT features clustered by kmeans model;

(2) Acoustic units from multiple codebooks by SoundStream.

Three-stage training:

(1) Autogressive modeling of semantic units;

(2) Coarse acoustic token prediction. Acoustic tokens are from first few codebooks, and these tokens are flattened to reflect the hierarchical structure. The coarse acoustic token prediction is conditioned on semantic tokens;

(3) Fine acoustic token prediction. Acoustic tokens from the remaining codebooks are predicted conditioned on the coarse acoustic tokens.

Applicaiton: Speech and music continuation.


Speech units: EnCodec units with multiple codebooks.

Training of diffusion model:

(1) Text is firstly encoded by phoneme encoder, and then used for duration and pitch predictor. The outputs are taken as the condition of diffusion model.

(2) Speech is processed by EnCodec model. A diffusion model is trained to predict speech vectors conditioned on the textual information in (1).

(3) Predicted vectors are quantized into acoustic units by the EnCodec quantizer.

Application: text-to-speech synthesis.


Speech Units:

(1) Semantic units from HuBERT model;

(2) Acoustic tokens from SoundStream.

PolyVoice uses multiple language models, one is a translation language model, one is duration LM, and the remaining one is speech synthesis LM which is a variant of VALL-E.


(1) Translation LM. A decoder-only LM is trained to generate target semantic units conditioned on source semantic units.

(2) Duration LM. Its input is the concatenation of source semantic units, target semantic units and source duration values, and the output is predicted target duration values.

(3) Speech synthesis LM. It’s similar to VALL-E in its combination of autoregressive decoder and a non-autoregressive decoder to decode acoustic tokens from multiple codebooks. The main difference is that VALL-E uses phoenemes and PolyVoice replaces phonemes with semantic units.


Similar to VALL-E, VioLA adopts a single autoregressive Transformer decoder-only model and EnCodec units as speech representations. The main difference resides in its multi-task learning invovling different types of data: speech-to-text, text-to-text and text-to-speech.

VioLA model consists of prenet (embedding module for text phonemes and acoustic tokens), decoder-only LM and prediction module (phoneme and acoustic unit prediction).

(1) Prenet: texts are converted to phonemes and encoded with a phoneme encoder. Speech is firstly converted to acoustic tokens by EnCodec’s multi-codebook, and speech embedding is the average of all-layer acoustic embeddings. Note that phoeneme and speech embeddings are further processed by LSTM.

(2) LM: its input contains language id and task id besides semantic embeddings of phonemes and speech. The output of LM could be phoneme sequence and acoustic tokens depending on the task. To enable acoustic token prediction in multiple codebooks, it uses VALL-E-style autoregressive and non-autoregressive prediction.

Application: speech-to-text, text-to-text, text-to-speech, speech-to-speech (cascaded approach built upon its ASR, MT and TTS task).


[1] Nguyen TA, Kharitonov E, Copet J, Adi Y, Hsu WN, Elkahky A, Tomasello P, Algayres R, Sagot B, Mohamed A, Dupoux E. Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics. 2023 Mar 27;11:250-66.

[2] Wang C, Chen S, Wu Y, Zhang Z, Zhou L, Liu S, Chen Z, Liu Y, Wang H, Li J, He L. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111. 2023 Jan 5.

[3] Borsos Z, Marinier R, Vincent D, Kharitonov E, Pietquin O, Sharifi M, Roblek D, Teboul O, Grangier D, Tagliasacchi M, Zeghidour N. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023 Jun 21.

[4] Shen K, Ju Z, Tan X, Liu Y, Leng Y, He L, Qin T, Zhao S, Bian J. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116. 2023 Apr 18.

[5] Dong Q, Huang Z, Xu C, Zhao Y, Wang K, Cheng X, Ko T, Tian Q, Li T, Yue F, Bai Y. PolyVoice: Language Models for Speech to Speech Translation. arXiv preprint arXiv:2306.02982. 2023 Jun 5.

[6] Wang T, Zhou L, Zhang Z, Wu Y, Liu S, Gaur Y, Chen Z, Li J, Wei F. VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation. arXiv preprint arXiv:2305.16107. 2023 May 25.