Research

Audiovisual Modeling: Developing audiovisual models in the context of human communication and interaction. My research explores how to integrate both speech and visual modalities so as to enhance the understanding of human interactions and the generation of natural facial expressions and body motions.

Multimodal Language Model: This line of research centers on language models that integrate text, speech and visual signals to advance multimodal understanding and generation. My research projects cover the modality fusion approaches and multi-task training with large-scale data to achieve the model capability of spoken dialogue generation and cross-modal translation.

Cross-modal Translation: This study addresses core challenges of translation across speech and text modalities, diving deeply into aligned data mining, massive multilinguality and multi-task training.

Multilingual Modeling: The explorations put emphasis on building adaptive model architectures, robust representations and scalable training recipes across diverse languages and domains.

Hongyu Gong

Research