EASL

Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

ICASSP 2026

A novel framework for generating emotionally expressive sign language videos from text using multi-emotion guidance and semantic disentanglement.

EASL Overview - Demonstrating multi-emotion sign language generation with confidence scores

Abstract

Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness.

We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition.

Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

Key Features

🎭

Multi-Emotion Guidance

Generates sign language with 7 emotion classes: Happy, Sad, Angry, Fear, Disgust, Surprise, and Neutral

🔀

Semantic Disentanglement

Separates semantic and emotional representations for better control and expressiveness

📈

Superior Performance

5.26% BLEU-4 improvement on PHOENIX14T and 5.63% improvement on Prompt2Sign

🎨

Diffusion Model Integration

Effectively adapts to Stable Diffusion for high-quality video generation

Methodology

Architecture Overview

EASL employs a dual-module architecture with two core components:

EASL Architecture - Three-phase progressive training strategy

Figure 1: EASL architecture showing DESE and EGSID modules with three-phase progressive training

DESE (Disentangled Emotion-Semantic Encoder)

Uses gated attention mechanisms to separate semantic and emotional representations from input text, extracting:

  • Semantic representations (H): Frame-wise features for sign pose generation
  • Emotional representations (E): Emotion-specific features for expression control

EGSID (Emotion-Guided Semantic Interaction Decoder)

Leverages emotional features to guide semantic decoding through emotion-aware multi-head attention, generating:

  • Pose sequences (P): High-quality keypoint sequences for sign language
  • Emotion confidence scores (Ec): 7-class emotion probabilities per frame

Three-Phase Progressive Training

1

Semantic Foundation

Learn semantic representation from text to pose mapping

2

Emotion Tone

Extract emotional representations while freezing semantic parameters

3

Joint Refinement

Fine-tune decoder for integrated emotion-semantic interaction

Experimental Results

Performance Comparison

PHOENIX14T Dataset

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L
PT-base 9.47 3.37 1.47 0.59 8.88
PT-FP&GN 13.35 7.29 5.33 4.31 13.17
DET 17.18 10.39 7.39 5.76 17.64
GEN-OBT 23.08 14.91 10.48 8.01 23.49
EASL (Ours) 37.46 20.13 14.97 13.27 33.00

Prompt2Sign Dataset

Method BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE-L
NSLP-G 17.55 11.62 8.21 5.75 31.98
Fast-SLP 39.46 23.38 17.35 12.85 46.89
NSA 41.31 25.44 18.25 13.12 47.55
SignLLM-M 43.40 25.72 19.08 14.13 51.57
EASL (Ours) 50.00 32.61 24.51 19.76 47.00

Ablation Study

Validation of each component's effectiveness on mixed test sets:

Configuration BLEU-4 ROUGE-L MAE
Full Model 16.15 38.69 5.05
w/o Three-Phases 12.60 (-3.55) 33.63 (-5.06) 8.49 (+3.44)
w/o EDESE 14.81 (-1.34) 35.78 (-2.91) 6.01 (+0.96)
w/o EEGSID 14.77 (-1.38) 36.49 (-2.20) 6.47 (+1.42)
w/o EDESE & EEGSID 13.56 (-2.59) 34.57 (-4.12) 7.52 (+2.47)

Visual Comparison

Visual comparison between original and EASL-generated sign language

Figure 2: Comparison showing EASL's ability to generate emotionally expressive sign language poses

Training Analysis

We evaluate emotion-semantic disentanglement effectiveness using cosine similarity between learned representations and BERT embeddings. The figure below illustrates the progressive training process across three phases.

Training analysis showing emotion-semantic disentanglement across three phases

Figure 3: Emotion similarity convergence across training phases, demonstrating effective disentanglement of semantic and emotional representations

During Phase 1, DESE's semantic representation H aligns with BERT semantic embeddings. In Phase 2, the emotion representation E converges toward emotion-specific BERT embeddings. After freezing DESE parameters in Phase 3, the divergence between H and E stabilizes, confirming emotion-semantic disentanglement.

Main Contributions

  1. A multi-emotion guided semantic disentanglement framework that decouples semantic and emotional features, effectively adapting to diffusion models for emotionally expressive sign language videos
  2. An emotion-semantic disentanglement architecture with DESE for separating representations and EGSID for emotion-guided semantic interaction
  3. A three-phase progressive training strategy with parameter freezing that achieves integration of emotional expression with sign language generation
  4. State-of-the-art pose accuracy on benchmark datasets with effective multi-emotion integration

Code & Resources

The source code for EASL is available on GitHub:

View on GitHub

Authors

Yanchao Zhao

University of Health and Rehabilitation Sciences

Jihao Zhu

The University of Aberdeen

Yu Liu

Institute of Information Engineering, CAS
University of Chinese Academy of Sciences
(Corresponding Author)

Weizhuo Chen

Institute of Information Engineering, CAS
University of Chinese Academy of Sciences

Yuling Yang

Institute of Information Engineering, CAS
University of Chinese Academy of Sciences

Kun Peng

Institute of Information Engineering, CAS
University of Chinese Academy of Sciences

Acknowledgements

This research is supported by the National Key R&D Program of China (Grant No. 2021YFF0901502).

Citation

@inproceedings{zhao2026easl,
  title={EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation},
  author={Zhao, Yanchao and Zhu, Jihao and Liu, Yu and Chen, Weizhuo and Yang, Yuling and Peng, Kun},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}