What signals does synthetic-audio detection analyze?

Synthetic-audio detection analyzes spectral artifacts, temporal inconsistencies, liveness cues, and conversion signatures that differentiate human speech from AI-generated audio on phone channels.

Why does on-device processing matter for voice clone detection?

On-device processing is essential for voice clone detection because it keeps call audio private and enables sub-second inference without network round trips. Vicall uses Apple's Neural Engine and CoreML to run analysis locally with zero cloud exposure.

How AI Voice Clone Detection Works: The Technology Behind Vicall

Q: How does AI voice clone detection work?

AI voice clone detection works by analyzing live call audio for machine-generated speech markers and liveness anomalies. Vicall runs this synthetic-audio detection model on-device during live calls and surfaces a verdict in under one second.

The Core Problem: Why Ears Fail, Why AI Doesn't

Human voice recognition is pattern matching against a stored mental model. Your brain remembers what someone sounds like and compares new audio against that memory. The problem is that AI voice cloning replicates the exact features your brain uses to recognize people — pitch, timbre, speech cadence, accent.

But there are features of genuine human speech that AI voice cloning does not replicate — features that are imperceptible to humans but mathematically detectable. This is why a deepfake phone call can fool your ears but not a properly designed verification system. This is the foundation of AI voice clone detection: finding signals in the audio that distinguish real human speech from synthesized or converted speech.

Vicall's on-device synthetic-audio detection model is trained to find and act on exactly these signals — producing a confidence score in real time that tells you, before any damage is done, whether the voice on the call is real or an AI clone.

Component 1: Synthetic-Audio Detection vs. Human Voice Recognition

It's important to distinguish between two related but different problems:

Task	Question Asked	Use Case
Speaker Recognition	"Who is speaking?"	Identifying an unknown speaker from a database of known voices
Synthetic-Audio Detection	"Is this voice human or machine-generated?"	Flagging AI-generated speech in live calls — what Vicall does

Vicall performs synthetic-audio detection, not caller identity matching. The system does not need to know who is calling; it determines whether incoming speech contains machine-generated characteristics during the live call.

Component 2: Signal Features and Liveness

Vicall analyzes a high-dimensional set of acoustic and temporal features that separate human speech from synthesized speech under real phone-call conditions.

The detection model captures signals across multiple levels:

Spectral features — the frequency distribution of the voice; the formant structure that gives a voice its characteristic sound
Prosodic features — the rhythm, stress patterns, and intonation contours unique to the speaker
Cepstral features — Mel-frequency cepstral coefficients (MFCCs) and other representations that capture vocal tract characteristics
Temporal features — speaking rate, pause patterns, breath placement
Deep embeddings — learned features from neural networks that capture aspects of voice not easily described analytically

These signals are analyzed continuously during each call on-device. No audio is transmitted to external servers.

Component 3: Why Voice Clones Get Flagged

An AI voice clone may replicate the features your ears use to recognize a person. But it still leaves synthetic markers that detection models can identify:

Signal Divergence

Even high-quality voice clones differ from human speech in feature space. Those divergences are often imperceptible to people but measurable to detection models.

Liveness Detection

Real human speech contains liveness signals that AI-generated audio does not reproduce:

Micro-variations in breath pressure during speech
Glottal pulse irregularities (the natural imperfections in how the vocal cords vibrate)
Microphone interaction patterns from a real person speaking in a real room
Natural co-articulation — the way sounds blend into each other differently when spoken by a real person vs. synthesized

Double Encoding Artifacts

Real-time voice conversion introduces a double encoding signature: the original audio is encoded by the voice cloning model, then transmitted over phone compression (G.711, Opus, etc.), creating a distinctive artifact pattern that differs from genuine speech transmitted over the same channel.

Temporal Inconsistencies

Real-time voice conversion operates with a buffer, typically 20-200ms, that introduces subtle temporal artifacts in speech timing. Detection models can identify this conversion signature.

Component 4: On-Device Neural Inference

Running synthetic-audio detection in real time on a live phone call requires two things: very low latency inference and complete audio privacy. Both require on-device processing.

<1s

Vicall delivers a verification confidence score in under one second after a call connects — before any significant conversation has taken place. On-device inference on Apple's Neural Engine makes this latency possible.

Vicall uses Apple's Neural Engine — the dedicated machine learning accelerator built into every modern iPhone — and Apple's CoreML framework to run the synthetic-audio detection model. The Neural Engine is designed for exactly this type of real-time, low-latency inference on high-dimensional audio data.

The benefits of on-device processing:

Sub-second latency — no network round-trip required; inference happens instantly
Complete audio privacy — audio is never transmitted to any server; your calls remain private
No connectivity requirement — detection works even without a data connection
Resilience — no central server to be attacked, compromised, or go offline

Component 5: Continuous and Adaptive Monitoring

Vicall does not perform a single check at the start of the call and stop. It monitors continuously throughout the call — an important capability because some attacks use a real human voice at the start of the call and switch to a clone mid-conversation. High-stakes attacks like the grandparent voice cloning scam depend on this sustained deception throughout the entire call.

The model continuously evaluates each call and can catch mid-call switching patterns, where human speech starts the call and synthetic speech appears later.

The Full Detection Pipeline

Audio capture

Incoming call audio is captured on-device. No recording is transmitted externally. The audio stream is processed in real time by the Vicall analysis pipeline.

Feature extraction

The on-device model extracts acoustic features from incoming audio, including MFCCs, spectral features, and deep embeddings, then scores them for synthetic markers in real time.

Synthetic-marker scoring

The extracted features are scored by a neural synthetic-audio detection model running on the Neural Engine to determine whether speech appears human or machine-generated.

Liveness analysis

Simultaneously, liveness signals are analyzed — detecting artifacts consistent with AI voice synthesis, real-time voice conversion, or replay attacks that would indicate the voice is not from a live human speaker.

Confidence score output

Within under 1 second, a live confidence score is displayed. The score updates continuously throughout the call. A significant drop in confidence mid-call triggers an immediate alert.

// FAQ

Frequently Asked Questions

How does AI voice clone detection work?

AI voice clone detection uses synthetic-audio detection — comparing the caller's voice against a stored mathematical detection model of the real person. Vicall analyzes synthetic markers from genuine calls, then runs a synthetic-audio detection model on-device during live calls to detect whether the incoming voice matches or diverges from the real person's synthetic-audio. The detection takes under 1 second.

What is a synthetic-audio profile?

A synthetic-audio profile is a mathematical vector encoding a person's unique vocal characteristics — pitch, formant structure, prosodic patterns, and deep acoustic features. Vicall builds a detection model for each contact from real calls, stores it encrypted on-device only, and uses it to verify callers in real time. The detection model improves in accuracy with more genuine calls.

Why does Vicall process everything on-device?

On-device processing keeps all audio and synthetic-audio data private — no voice data ever leaves your phone. It also enables sub-second inference latency without network round-trips, works without a data connection, and eliminates any central server that could be compromised. Vicall uses Apple's Neural Engine and CoreML for all inference.

// Vicall

Real-Time Detection.
Zero Cloud.

Vicall's on-device AI detects voice clones in under 1 second. Join the private beta and be among the first to use it.

Private beta · No spam · Founding members only