Speech Tokenizers

1 minute read


Speech tokenizer, as the name suggests, is to convert continous speech waveform into discrete tokens (usually called units). It bridges the gap between speech and text representations, also simplifying the manipulation of speech signals.

Semantic Tokenizer

Semantic tokenizer like HuBERT [1] extracts units with semantic information from speech. More generally, it is found that pre-trained speech encoders such as w2v-BERT could learn semantic features from input speech. Recent works like AudioLM [2] applie kmeans clustering to w2v-BERT features, and reports the learned semantic units are also of good quality.

Acoustic Tokenizer

Audio codec models such as Encodec and SoundStream demonstrate the ability to well preserve acoustic information like speaker and prosody characteristics in the speech. The learned acoustic tokens usually come from multiple codebooks. These models typically have encoder-decoder architecture, consisting of an audio encoder, a residual vector quantizer and an audio decoder.


[1] Hsu WN, Bolte B, Tsai YH, Lakhotia K, Salakhutdinov R, Mohamed A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021 Oct 26;29:3451-60.

[2] Borsos Z, Marinier R, Vincent D, Kharitonov E, Pietquin O, Sharifi M, Roblek D, Teboul O, Grangier D, Tagliasacchi M, Zeghidour N. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2023 Jun 21.

[3] Défossez A, Copet J, Synnaeve G, Adi Y. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438. 2022 Oct 24.

[4] Zeghidour N, Luebs A, Omran A, Skoglund J, Tagliasacchi M. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021 Nov 23;30:495-507.