A novel framework for generating emotionally expressive sign language videos from text using multi-emotion guidance and semantic disentanglement.
Abstract
Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness.
We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition.
Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.
Key Features
Multi-Emotion Guidance
Generates sign language with 7 emotion classes: Happy, Sad, Angry, Fear, Disgust, Surprise, and Neutral
Semantic Disentanglement
Separates semantic and emotional representations for better control and expressiveness
Superior Performance
5.26% BLEU-4 improvement on PHOENIX14T and 5.63% improvement on Prompt2Sign
Diffusion Model Integration
Effectively adapts to Stable Diffusion for high-quality video generation
Methodology
Architecture Overview
EASL employs a dual-module architecture with two core components:
Figure 1: EASL architecture showing DESE and EGSID modules with three-phase progressive training
DESE (Disentangled Emotion-Semantic Encoder)
Uses gated attention mechanisms to separate semantic and emotional representations from input text, extracting:
- Semantic representations (H): Frame-wise features for sign pose generation
- Emotional representations (E): Emotion-specific features for expression control
EGSID (Emotion-Guided Semantic Interaction Decoder)
Leverages emotional features to guide semantic decoding through emotion-aware multi-head attention, generating:
- Pose sequences (P): High-quality keypoint sequences for sign language
- Emotion confidence scores (Ec): 7-class emotion probabilities per frame
Three-Phase Progressive Training
Semantic Foundation
Learn semantic representation from text to pose mapping
Emotion Tone
Extract emotional representations while freezing semantic parameters
Joint Refinement
Fine-tune decoder for integrated emotion-semantic interaction
Experimental Results
Performance Comparison
PHOENIX14T Dataset
| Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L |
|---|---|---|---|---|---|
| PT-base | 9.47 | 3.37 | 1.47 | 0.59 | 8.88 |
| PT-FP&GN | 13.35 | 7.29 | 5.33 | 4.31 | 13.17 |
| DET | 17.18 | 10.39 | 7.39 | 5.76 | 17.64 |
| GEN-OBT | 23.08 | 14.91 | 10.48 | 8.01 | 23.49 |
| EASL (Ours) | 37.46 | 20.13 | 14.97 | 13.27 | 33.00 |
Prompt2Sign Dataset
| Method | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-L |
|---|---|---|---|---|---|
| NSLP-G | 17.55 | 11.62 | 8.21 | 5.75 | 31.98 |
| Fast-SLP | 39.46 | 23.38 | 17.35 | 12.85 | 46.89 |
| NSA | 41.31 | 25.44 | 18.25 | 13.12 | 47.55 |
| SignLLM-M | 43.40 | 25.72 | 19.08 | 14.13 | 51.57 |
| EASL (Ours) | 50.00 | 32.61 | 24.51 | 19.76 | 47.00 |
Ablation Study
Validation of each component's effectiveness on mixed test sets:
| Configuration | BLEU-4 | ROUGE-L | MAE |
|---|---|---|---|
| Full Model | 16.15 | 38.69 | 5.05 |
| w/o Three-Phases | 12.60 (-3.55) | 33.63 (-5.06) | 8.49 (+3.44) |
| w/o EDESE | 14.81 (-1.34) | 35.78 (-2.91) | 6.01 (+0.96) |
| w/o EEGSID | 14.77 (-1.38) | 36.49 (-2.20) | 6.47 (+1.42) |
| w/o EDESE & EEGSID | 13.56 (-2.59) | 34.57 (-4.12) | 7.52 (+2.47) |
Visual Comparison
Figure 2: Comparison showing EASL's ability to generate emotionally expressive sign language poses
Training Analysis
We evaluate emotion-semantic disentanglement effectiveness using cosine similarity between learned representations and BERT embeddings. The figure below illustrates the progressive training process across three phases.
Figure 3: Emotion similarity convergence across training phases, demonstrating effective disentanglement of semantic and emotional representations
During Phase 1, DESE's semantic representation H aligns with BERT semantic embeddings. In Phase 2, the emotion representation E converges toward emotion-specific BERT embeddings. After freezing DESE parameters in Phase 3, the divergence between H and E stabilizes, confirming emotion-semantic disentanglement.
Main Contributions
- A multi-emotion guided semantic disentanglement framework that decouples semantic and emotional features, effectively adapting to diffusion models for emotionally expressive sign language videos
- An emotion-semantic disentanglement architecture with DESE for separating representations and EGSID for emotion-guided semantic interaction
- A three-phase progressive training strategy with parameter freezing that achieves integration of emotional expression with sign language generation
- State-of-the-art pose accuracy on benchmark datasets with effective multi-emotion integration
Code & Resources
The source code for EASL is available on GitHub:
View on GitHubAcknowledgements
This research is supported by the National Key R&D Program of China (Grant No. 2021YFF0901502).
Citation
@inproceedings{zhao2026easl,
title={EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation},
author={Zhao, Yanchao and Zhu, Jihao and Liu, Yu and Chen, Weizhuo and Yang, Yuling and Peng, Kun},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026}
}