The Core Problem: Why Ears Fail, Why AI Doesn't

Human voice recognition is pattern matching against a stored mental model. Your brain remembers what someone sounds like and compares new audio against that memory. The problem is that AI voice cloning replicates the exact features your brain uses to recognize people — pitch, timbre, speech cadence, accent.

But there are features of genuine human speech that AI voice cloning does not replicate — features that are imperceptible to humans but mathematically detectable. This is why a deepfake phone call can fool your ears but not a properly designed verification system. This is the foundation of AI voice clone detection: finding signals in the audio that distinguish real human speech from synthesized or converted speech.

Vicall's on-device synthetic-audio detection model is trained to find and act on exactly these signals — producing a confidence score in real time that tells you, before any damage is done, whether the voice on the call is real or an AI clone.

Component 1: Synthetic-Audio Detection vs. Human Voice Recognition

It's important to distinguish between two related but different problems:

Task Question Asked Use Case
Speaker Recognition "Who is speaking?" Identifying an unknown speaker from a database of known voices
Synthetic-Audio Detection "Is this voice human or machine-generated?" Flagging AI-generated speech in live calls — what Vicall does

Vicall performs synthetic-audio detection, not caller identity matching. The system does not need to know who is calling; it determines whether incoming speech contains machine-generated characteristics during the live call.

Component 2: Signal Features and Liveness

Vicall analyzes a high-dimensional set of acoustic and temporal features that separate human speech from synthesized speech under real phone-call conditions.

The detection model captures signals across multiple levels:

These signals are analyzed continuously during each call on-device. No audio is transmitted to external servers.

Component 3: Why Voice Clones Get Flagged

An AI voice clone may replicate the features your ears use to recognize a person. But it still leaves synthetic markers that detection models can identify:

Signal Divergence

Even high-quality voice clones differ from human speech in feature space. Those divergences are often imperceptible to people but measurable to detection models.

Liveness Detection

Real human speech contains liveness signals that AI-generated audio does not reproduce:

Double Encoding Artifacts

Real-time voice conversion introduces a double encoding signature: the original audio is encoded by the voice cloning model, then transmitted over phone compression (G.711, Opus, etc.), creating a distinctive artifact pattern that differs from genuine speech transmitted over the same channel.

Temporal Inconsistencies

Real-time voice conversion operates with a buffer, typically 20-200ms, that introduces subtle temporal artifacts in speech timing. Detection models can identify this conversion signature.

Component 4: On-Device Neural Inference

Running synthetic-audio detection in real time on a live phone call requires two things: very low latency inference and complete audio privacy. Both require on-device processing.

<1s
Vicall delivers a verification confidence score in under one second after a call connects — before any significant conversation has taken place. On-device inference on Apple's Neural Engine makes this latency possible.

Vicall uses Apple's Neural Engine — the dedicated machine learning accelerator built into every modern iPhone — and Apple's CoreML framework to run the synthetic-audio detection model. The Neural Engine is designed for exactly this type of real-time, low-latency inference on high-dimensional audio data.

The benefits of on-device processing:

Component 5: Continuous and Adaptive Monitoring

Vicall does not perform a single check at the start of the call and stop. It monitors continuously throughout the call — an important capability because some attacks use a real human voice at the start of the call and switch to a clone mid-conversation. High-stakes attacks like the grandparent voice cloning scam depend on this sustained deception throughout the entire call.

The model continuously evaluates each call and can catch mid-call switching patterns, where human speech starts the call and synthetic speech appears later.

The Full Detection Pipeline

01

Audio capture

Incoming call audio is captured on-device. No recording is transmitted externally. The audio stream is processed in real time by the Vicall analysis pipeline.

02

Feature extraction

The on-device model extracts acoustic features from incoming audio, including MFCCs, spectral features, and deep embeddings, then scores them for synthetic markers in real time.

03

Synthetic-marker scoring

The extracted features are scored by a neural synthetic-audio detection model running on the Neural Engine to determine whether speech appears human or machine-generated.

04

Liveness analysis

Simultaneously, liveness signals are analyzed — detecting artifacts consistent with AI voice synthesis, real-time voice conversion, or replay attacks that would indicate the voice is not from a live human speaker.

05

Confidence score output

Within under 1 second, a live confidence score is displayed. The score updates continuously throughout the call. A significant drop in confidence mid-call triggers an immediate alert.

// FAQ

Frequently Asked Questions

AI voice clone detection uses synthetic-audio detection — comparing the caller's voice against a stored mathematical detection model of the real person. Vicall analyzes synthetic markers from genuine calls, then runs a synthetic-audio detection model on-device during live calls to detect whether the incoming voice matches or diverges from the real person's synthetic-audio. The detection takes under 1 second.

A synthetic-audio profile is a mathematical vector encoding a person's unique vocal characteristics — pitch, formant structure, prosodic patterns, and deep acoustic features. Vicall builds a detection model for each contact from real calls, stores it encrypted on-device only, and uses it to verify callers in real time. The detection model improves in accuracy with more genuine calls.

On-device processing keeps all audio and synthetic-audio data private — no voice data ever leaves your phone. It also enables sub-second inference latency without network round-trips, works without a data connection, and eliminates any central server that could be compromised. Vicall uses Apple's Neural Engine and CoreML for all inference.

// Vicall

Real-Time Detection.
Zero Cloud.

Vicall's on-device AI detects voice clones in under 1 second. Join the private beta and be among the first to use it.

Private beta · No spam · Founding members only