Safeguarding Synthetic Speech

Safeguards for synthetic speech: Ethical, technical and legal perspectives

Safeguards for synthetic speech: Ethical, technical and legal perspectives

Why is this Special Session so needed?

Speech synthesis has advanced rapidly in the last five years (e.g. (X. Chen et al., 2024; Hayashi et al., 2019; Valle, Li, et al., 2020; Valle, Shih, et al., 2020). Some speech synthesis models such as VALL-E require merely seconds of speech data to produce a synthesised voice (S. Chen et al., 2024). Moreover, this technology has been widely adopted due to its new-found accessibility, creating new opportunities for personalised communication, accessibility applications and creative content generation.

However, these new capabilities have raised significant ethical, legal and social concerns surrounding consent, identity protection, copyright, misuse and authenticity (Burgess et al., 2025). Recent legal developments such as the successful court case brought by German actor Manfred Lehmann - the German voice actor for Bruce Willis - who had his voice cloned without consent by a YouTuber (Reinholz & Schmidt, 2025), and the Ensuring Likeness Voice and Image Security Act (ELVIS) legislated by the US State of Tennessee (Kirkwood, 2025), as well as efforts by industry, such as Hugging Face’s voice consent gate initiative (Mitchell & Kaffe, 2025), have centred the need to establish social, legal and ethical norms for the use of synthesised speech.

This special session addresses the critical gap between technological capabilities, ethical and legal frameworks for governing and steering synthetic speech systems and establishing social norms for the use of these systems. The field urgently needs interdisciplinary approaches to consent management, deepfake prevention, watermarketing, and other safety mechanisms and safeguarding protocols.

What will this Special Session cover?

The session will explore multiple dimensions of this challenge:

Technical approaches to consent verification, watermarking, and authentication in TTS systems
Legal frameworks for personality rights, data protection, and liability in voice synthesis
Ethical considerations around vulnerable populations, posthumous voice rights, and cultural sensitivities
Industry perspectives on implementing consent management at scale
User studies on public perception, trust, and acceptance of voice cloning technologies

The objectives are to: (a) establish shared understanding of current challenges and regulatory landscape; (b) present state-of-the-art technical solutions for consent and safety; (c) foster collaboration between technical, legal, and ethical experts; (d) develop recommendations for best practices and standards; and (e) identify critical research gaps requiring community attention.

Session Format

The special session will combine:

Keynote presentation on legal, social or ethical implications of voice cloning (30 minutes)
Oral presentations of peer-reviewed papers (8-10 papers, 12 minutes each)
Panel discussion with industry representatives, legal experts, and researchers (30 minutes)

How do I submit to Safeguards for synthetic speech: Ethical, technical and legal perspectives?

If the topics in this Special Session are of interest to you, please submit via the general Interspeech Call for Papers process.

Visit the Interspeech 2026 Call for Papers

References

Burgess, J., Carlon, D., & Doyuran, E. B. (2025). Voice AI and authenticity: Current issues and emerging challenges. ARC Centre of Excellence for Automated Decision-Making and Society. https://apo.org.au/node/331920

Chen, S., Liu, S., Zhou, L., Liu, Y., Tan, X., Li, J., Zhao, S., Qian, Y., & Wei, F. (2024). VALL-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers. https://arxiv.org/abs/2406.05370

Chen, X., Wang, X., Zhang, S., He, L., Wu, Z., Wu, X., & Meng, H. (2024). Stylespeech: Self-supervised style enhancing with vq-vae-based pre-training for expressive audiobook speech synthesis. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 12316–12320.

Hayashi, T., Yamamoto, R., Inoue, K., Yoshimura, T., Watanabe, S., Toda, T., Takeda, K., Zhang, Y., & Tan, X. (2019). ESPnet-TTS: Unified, Reproducible, and Integratable Open Source End-to-End Text-to-Speech Toolkit. arXiv Preprint arXiv:1910.10909.

Kirkwood, J. (2025, March 24). Why Tennessee’s ELVIS Act Is the King of Artificial Intelligence Protections. Vanderbilt Law School. https://law.vanderbilt.edu/why-tennessees-elvis-act-is-the-king-of-artificial-intelligence-protections

Mitchell, M., & Kaffe, L.-A. (2025, October 31). Voice Cloning with Consent. https://huggingface.co/blog/voice-consent-gate

Reinholz, F., & Schmidt, R. (2025, October 14). Voice clones by AI in court—Dubbing artist wins at Berlin Regional Court. HÄRTING Rechtsanwälte. https://haerting.de/en/insights/voice-clones-by-ai-in-court-dubbing-artist-wins-at-berlin-regional-court/

Valle, R., Li, J., Prenger, R., & Catanzaro, B. (2020). Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6189–6193.

Valle, R., Shih, K. J., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. International Conference on Learning Representations. https://iclr.cc/virtual/2021/poster/3204