Research
Audiovisual Modeling: Developing audiovisual models in the context of human communication and interaction. My research explores how to integrate both speech and visual modalities so as to enhance the understanding of human interactions and the generation of natural facial expressions and body motions.
- Embodied AI Agents: Modeling the World
- Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset.
- AV-Flow: Transforming Text to Audio-Visual Human-like Interactions.
Multimodal Language Model: This line of research centers on language models that integrate text, speech and visual signals to advance multimodal understanding and generation. My research projects cover the modality fusion approaches and multi-task training with large-scale data to achieve the model capability of spoken dialogue generation and cross-modal translation.
- AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
- SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought.
- Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
- Investigating Decoder-only Large Language Models for Speech-to-text Translation
- MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation
Cross-modal Translation: This study addresses core challenges of translation across speech and text modalities, diving deeply into aligned data mining, massive multilinguality and multi-task training.
- Joint Speech and Text Machine Translation for Up To 100 Languages
- SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations
- Pre-training for Speech Translation: CTC Meets Optimal Transport
- Multilingual Speech-to-Speech Translation into Multiple Target Languages
- T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine Translation -`Textless Speech-to-Speech Translation on Real Data
- Unified Speech-text Pre-training for Speech Translation and Recognition
Multilingual Modeling: The explorations put emphasis on building adaptive model architectures, robust representations and scalable training recipes across diverse languages and domains.
