LeetCall - IA Vocale pour Entreprises

Introduction: The Critical Role of VAD

Voice Activity Detection (VAD) is an often underestimated but absolutely critical component of conversational voice agent systems. Its role: distinguish in real-time speech-containing segments from silence or noise segments. This distinction conditions:

- **STT efficiency**: Avoid transcribing silence or noise - **Conversational fluidity**: Detect when the user has finished speaking - **Resource economy**: Process only relevant segments

This article explores technical challenges and modern VAD approaches in the context of real-time voice agents.

1. Fundamentals and Metrics

1.1 Formal Definition

VAD is a frame-by-frame binary classification problem:

- **Class 1**: Speech - **Class 0**: Non-speech (silence, noise, music)

Typical temporal granularity is 10-30ms per frame.

1.2 Evaluation Metrics

Precision and Recall

- **Precision**: Proportion of frames classified as "speech" that are actually speech - **Recall**: Proportion of speech frames correctly detected

Speech Hit Rate (SHR) and False Alarm Rate (FAR)

- SHR: Correct speech detection rate - FAR: False alarm rate (noise classified as speech)

2. Classic Feature-Based Approaches

2.1 Energy Methods

Short-Time Energy (STE)

- Adaptive thresholding on energy - Advantage: Very low latency, minimal computational cost - Limitation: Noise-sensitive, ineffective in noisy environments

Zero-Crossing Rate (ZCR)

- Counts signal sign changes - Complementary to energy (noise vs voiced speech) - Limitation: Not very discriminating alone

3. Deep Learning Approaches

3.1 Recurrent Neural Networks

LSTM-based VAD

- Architecture: Bidirectional LSTM + dense layer - Input: Feature sequence (MFCC, log-mel spectrogram) - Output: Speech probability per frame

Conclusion

Voice Activity Detection, though technically "simple" in appearance, remains a major challenge for production conversational voice agents. Modern deep learning approaches have considerably improved robustness and accuracy, but challenges persist:

- **Ultra-low latency**: < 50ms for natural conversation - **Extreme robustness**: Function in all acoustic environments - **Contextual adaptation**: Adjust behavior according to conversational context - **Computational efficiency**: Large-scale deployment

References

- Défossez, A. et al. (2020). Real Time Speech Enhancement in the Waveform Domain. *INTERSPEECH* - Ramirez, J. et al. (2004). Statistical voice activity detection using a multiple observation likelihood ratio test. *IEEE Signal Processing Letters* - Zhang, X. et al. (2018). Streaming Voice Activity Detection for Real-Time Communication. *ICASSP*

VAD and Voice Activity Detection: Challenges and Modern Approaches