SDML Demo · Deepfake Source Verification

Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning

Xi Xuan^1,2 Wenxin Zhang³ Zhiyu Li⁴ Jennifer Williams⁵ Ville Hautamäki¹ Tomi Kinnunen¹

¹University of Eastern Finland ²City University of Hong Kong ³University of Chinese Academy of Sciences
⁴University of Science and Technology of China ⁵University of Southampton

Paper Code Audio Demo

§ 1

The Speech Deepfake Source Verification Challenge

Utterance A

—

Similarity Score

τ = 0.50

0
Diff1
Same

0.83

● Same Source

Utterance B

—

Ground Truth ✓ Same Source

Both utterances from the same TTS system, same target speaker.

Scores from RiemanSD-AAM (ResNet34). Audio from MLAAD v8.

§ 2

Model Comparison on Hard Case

Same speaker, different TTS systems

AAM-Softmax

Baseline

0.67 — above thresholdPredicts "Same Source"

✗ Wrong

ChebySD-AAM

Ours

0.43 — below thresholdPredicts "Different Source"

✓ Correct

EER ↓ 1.39% (P-III)

RiemanSD-AAM

Ours · Best

0.31 — clearly below thresholdPredicts "Different Source"

✓ Correct

EER ↓ 3.16% (P-III)

§ 3

Embedding Visualization

Tacotron2-DDC

VITS-neon

suno/bark

XTTS-v2

MeloTTS

FastPitch

Distinct, compact clusters per TTS system confirm effective source separation.

§ 4

Evaluation Protocols

Four protocols × two axes: source visibility (seen/unseen) and speaker condition (same/different). 27,530 utterances each, 1:1 balanced. EER/AUC from ResNet34 + RiemanSD-AAM (Table 2).

P-I

Seen Source — Same Speaker

0.68%EER

0.998AUC

Sample pair

tacotron2-DDC_ph

Same speaker

tacotron2-DDC_ph

Same speaker

P-II

Seen Source — Different Speaker

1.21%EER

0.996AUC

Sample pair

VITS-neon

Speaker A

VITS-neon

Speaker B

P-III · Hard

Unseen Source — Same Speaker

4.08%EER

0.988AUC

Hard pair — diff source, same voice

overflow

Same speaker

VITS

Same speaker

P-IV · Hard

Unseen Source — Different Speaker

7.13%EER

0.972AUC

Hard pair — same source, diff voices

parler_tts_mini_v1

Speaker A

parler_tts_mini_v1

Speaker B

§ 5

Source Generator Audio Samples

12 representative TTS systems from the MLAAD v8 benchmark, grouped by architectural family.

suno/bark

Foundation ~900M

Auto-regressive codec LM · multilingual

Waveform

xtts_v2

Foundation ~500M

Zero-shot voice cloning · 17 languages

Waveform

Mel spectrogram

parler_tts_mini_v1

Foundation ~120M

Description-guided · controllable style

Waveform

Mel spectrogram

Metavoice-1B

Foundation 1B

High-fidelity · fine-tunable voice cloning

Waveform

Mel spectrogram

WhisperSpeech

Foundation ~350M

Open-source · Whisper encoder backbone

Waveform

Mel spectrogram

vits--neon (en/ljspeech)

VITS ~28M

End-to-end flow + GAN · English LJSpeech

Waveform

Mel spectrogram

MeloTTS

VITS-derived ~30M

Multi-accent · multilingual · fast inference

Waveform

Mel spectrogram

vixTTS

VITS variant ~30M

Expressive VITS-based synthesis

Waveform

Mel spectrogram

tacotron2-DDC (en/ljspeech)

Tacotron2 ~28M

Seq2seq attention + HiFi-GAN vocoder

Waveform

Mel spectrogram

fast_pitch (en/ljspeech)

FastPitch ~45M

Non-autoregressive · explicit pitch control

Waveform

Mel spectrogram

griffin_lim

Traditional vocoder

Phase reconstruction from mel · no neural vocoder

Waveform

Mel spectrogram

facebook/mms-tts-eng

Meta MMS ~100M

Massively multilingual · 1,000+ languages · VITS-based

Waveform

Mel spectrogram

§ 6

Citation

@inproceedings{xuan2026sdml, title = {Disentangling Speaker Traits for Deepfake Source Verification via Chebyshev Polynomial and Riemannian Metric Learning}, author = {Xuan, Xi and Zhang, Wenxin and Li, Zhiyu and Williams, Jennifer and Hautam{\"a}ki, Ville and Kinnunen, Tomi}, year = {2026}, url = {https://github.com/xxuan-acoustics/RiemannSD-Net}, }