The Core Problem: Why Ears Fail, Why AI Doesn't
Human voice recognition is pattern matching against a stored mental model. Your brain remembers what someone sounds like and compares new audio against that memory. The problem is that AI voice cloning replicates the exact features your brain uses to recognize people — pitch, timbre, speech cadence, accent.
But there are features of genuine human speech that AI voice cloning does not replicate — features that are imperceptible to humans but mathematically detectable. This is why a deepfake phone call can fool your ears but not a properly designed verification system. This is the foundation of AI voice clone detection: finding signals in the audio that distinguish real human speech from synthesized or converted speech.
Vicall's on-device synthetic-audio detection model is trained to find and act on exactly these signals — producing a confidence score in real time that tells you, before any damage is done, whether the voice on the call is real or an AI clone.
Component 1: Synthetic-Audio Detection vs. Human Voice Recognition
It's important to distinguish between two related but different problems:
| Task | Question Asked | Use Case |
|---|---|---|
| Speaker Recognition | "Who is speaking?" | Identifying an unknown speaker from a database of known voices |
| Synthetic-Audio Detection | "Is this voice human or machine-generated?" | Flagging AI-generated speech in live calls — what Vicall does |
Vicall performs synthetic-audio detection, not caller identity matching. The system does not need to know who is calling; it determines whether incoming speech contains machine-generated characteristics during the live call.
Component 2: Signal Features and Liveness
Vicall analyzes a high-dimensional set of acoustic and temporal features that separate human speech from synthesized speech under real phone-call conditions.
The detection model captures signals across multiple levels:
- Spectral features — the frequency distribution of the voice; the formant structure that gives a voice its characteristic sound
- Prosodic features — the rhythm, stress patterns, and intonation contours unique to the speaker
- Cepstral features — Mel-frequency cepstral coefficients (MFCCs) and other representations that capture vocal tract characteristics
- Temporal features — speaking rate, pause patterns, breath placement
- Deep embeddings — learned features from neural networks that capture aspects of voice not easily described analytically
These signals are analyzed continuously during each call on-device. No audio is transmitted to external servers.
Component 3: Why Voice Clones Get Flagged
An AI voice clone may replicate the features your ears use to recognize a person. But it still leaves synthetic markers that detection models can identify:
Signal Divergence
Even high-quality voice clones differ from human speech in feature space. Those divergences are often imperceptible to people but measurable to detection models.
Liveness Detection
Real human speech contains liveness signals that AI-generated audio does not reproduce:
- Micro-variations in breath pressure during speech
- Glottal pulse irregularities (the natural imperfections in how the vocal cords vibrate)
- Microphone interaction patterns from a real person speaking in a real room
- Natural co-articulation — the way sounds blend into each other differently when spoken by a real person vs. synthesized
Double Encoding Artifacts
Real-time voice conversion introduces a double encoding signature: the original audio is encoded by the voice cloning model, then transmitted over phone compression (G.711, Opus, etc.), creating a distinctive artifact pattern that differs from genuine speech transmitted over the same channel.
Temporal Inconsistencies
Real-time voice conversion operates with a buffer, typically 20-200ms, that introduces subtle temporal artifacts in speech timing. Detection models can identify this conversion signature.
Component 4: On-Device Neural Inference
Running synthetic-audio detection in real time on a live phone call requires two things: very low latency inference and complete audio privacy. Both require on-device processing.
Vicall uses Apple's Neural Engine — the dedicated machine learning accelerator built into every modern iPhone — and Apple's CoreML framework to run the synthetic-audio detection model. The Neural Engine is designed for exactly this type of real-time, low-latency inference on high-dimensional audio data.
The benefits of on-device processing:
- Sub-second latency — no network round-trip required; inference happens instantly
- Complete audio privacy — audio is never transmitted to any server; your calls remain private
- No connectivity requirement — detection works even without a data connection
- Resilience — no central server to be attacked, compromised, or go offline
Component 5: Continuous and Adaptive Monitoring
Vicall does not perform a single check at the start of the call and stop. It monitors continuously throughout the call — an important capability because some attacks use a real human voice at the start of the call and switch to a clone mid-conversation. High-stakes attacks like the grandparent voice cloning scam depend on this sustained deception throughout the entire call.
The model continuously evaluates each call and can catch mid-call switching patterns, where human speech starts the call and synthetic speech appears later.
The Full Detection Pipeline
Audio capture
Incoming call audio is captured on-device. No recording is transmitted externally. The audio stream is processed in real time by the Vicall analysis pipeline.
Feature extraction
The on-device model extracts acoustic features from incoming audio, including MFCCs, spectral features, and deep embeddings, then scores them for synthetic markers in real time.
Synthetic-marker scoring
The extracted features are scored by a neural synthetic-audio detection model running on the Neural Engine to determine whether speech appears human or machine-generated.
Liveness analysis
Simultaneously, liveness signals are analyzed — detecting artifacts consistent with AI voice synthesis, real-time voice conversion, or replay attacks that would indicate the voice is not from a live human speaker.
Confidence score output
Within under 1 second, a live confidence score is displayed. The score updates continuously throughout the call. A significant drop in confidence mid-call triggers an immediate alert.
Frequently Asked Questions
AI voice clone detection uses synthetic-audio detection — comparing the caller's voice against a stored mathematical detection model of the real person. Vicall analyzes synthetic markers from genuine calls, then runs a synthetic-audio detection model on-device during live calls to detect whether the incoming voice matches or diverges from the real person's synthetic-audio. The detection takes under 1 second.
A synthetic-audio profile is a mathematical vector encoding a person's unique vocal characteristics — pitch, formant structure, prosodic patterns, and deep acoustic features. Vicall builds a detection model for each contact from real calls, stores it encrypted on-device only, and uses it to verify callers in real time. The detection model improves in accuracy with more genuine calls.
On-device processing keeps all audio and synthetic-audio data private — no voice data ever leaves your phone. It also enables sub-second inference latency without network round-trips, works without a data connection, and eliminates any central server that could be compromised. Vicall uses Apple's Neural Engine and CoreML for all inference.
Real-Time Detection.
Zero Cloud.
Vicall's on-device AI detects voice clones in under 1 second. Join the private beta and be among the first to use it.
Private beta · No spam · Founding members only