

David “Prz” Przygoda, CEO of OmniSpeech recently reached out to share his insights on identifying AI-generated voices, now available in the OmniSpeech AI-Detect plug-in for Zoom. I asked David several questions about the company’s product and business model. You can read our extensive interview below.
The Serious Insights Interview with David “Prz” Przygoda
The interview has been lightly edited for grammatical consistency.
What specific failure mode in live video meetings pushed OmniSpeech to ship AI Detect as a Zoom app, rather than a standalone service?
We chose to ship AI Detect as a Zoom app first because the most common and high‑impact threat we were seeing wasn’t archived media — it was fraud and impersonation happening live in synchronous communications. Companies, educators, and governments increasingly rely on Zoom for mission‑critical communications. If a deepfake voice can be introduced live, it can authorize decisions, engineer scams, or undermine trust in real time.
By integrating directly into the Zoom ecosystem, we can run near‑instant detection during actual conversations rather than forcing users to export and upload recordings afterward. The marketplace integration dramatically lowers friction for enterprise adoption because it plugs straight into where people are already communicating.
OmniSpeech describes “signal analysis,” ML, and “source-agnostic detection.” What does that mean in practice: what kinds of synthetic voices does it reliably catch, and what kinds still slip through?
Signal analysis plus machine learning” means we don’t just look at simple waveform patterns or pitch characteristics — we combine acoustic signal features with deep neural models trained to spot the nuances of synthetic speech signatures that humans can’t hear. “Source‑agnostic detection” means the detector isn’t tuned only to one vendor’s voice synthesis model; it is designed to generalize across different generators and voices regardless of how they were created. This adds tremendous value to our customers and improves the efficiency and effectiveness of our tools
In practice: We catch modern voice‑clones and AI-synthesized voices that exhibit subtle artifacts or statistical patterns distinct from natural human speech (this is our secret sauce). It reliably flags voices generated from common and newly developed text‑to‑speech or cloning services. We’re really proud of the performance, even on unseen generators. That said, no detector is perfect — especially if an adversary uses ultra‑advanced models with adversarial post‑processing or blends synthetic and real audio. Short bursts of synthetic speech are also more difficult to detect, so interpretation should always consider context.

How does OmniSpeech measure accuracy for deepfake voice detection (false positives/false negatives, confidence thresholds), and what does “good enough” look like for an enterprise security team?
We measure accuracy in terms of true positives, false positives, and false negatives for synthetic detection. Within the app, you see confidence levels reflected in the red/yellow/green UI:
🔴 Red = likely deepfake (high confidence in fake voice)
🟡 Yellow = maybe deepfake — (lower confidence but enough signal to indicate escalated caution)
🟢 Green = likely human – (high confidence in real voice)
- Confidence thresholds are calibrated per use case to minimize false alarms while still catching real threats. For enterprise security teams, “good enough” means:
- Being able to signal suspicious voices with high enough confidence to trigger procedural checks;
- Ensuring false positives are rare so users don’t ignore alerts; and
- Giving security ops context they can act on (not just binary yes/no), enabling escalation workflows.
In real deployments, that balance is measured both quantitatively during testing and operationally against business‑defined risk tolerance — similar to how security teams evaluate spam filters or malware detection systems.
AI Detect runs 10-second scans and recommends at least ~4 seconds of clear speech. Why those numbers, and how sensitive are the results to crosstalk, interruptions, and fast turn-taking?
We’ve optimized the scan window based on Zoom RTMS and model performance: 10 seconds gives the model enough temporal context to analyze multiple acoustic features and conduct multiple scans reliably. Within that window, a solid ~4 seconds of continuous active speech from a participant for the classifier tends to give the most confident assessment.
Short bursts limit what the model can infer. If there’s crosstalk, interruptions, or rapid turn‑taking, the model may still analyze segments of each speaker who meets the ~4‑second active speech threshold, but precision can dip if too many voices overlap. That’s why users typically recommend initiating scans when a single participant is speaking clearly — the model performs best in those conditions. With Zoom RTMs, each user’s audio feed is separate, so technically, we can actually analyze and provide accurate results for multiple users at the same time.
What are the real-world limits in compressed Zoom audio: background noise, aggressive noise suppression, bad mics, speakerphones, music beds, or non-English speech?
Compressed and processed audio — like Zoom and almost all other mic-enabled platforms — certainly need to be factored into model design and performance. The more audio is processed, the more artifacts will be prevalent. In some cases, the detection of those artifacts helps us. The biggest practical limits are:
- Aggressive noise suppression or heavy codec artifacts can mask subtle spectral cues the model uses.
- Poor‑quality microphones or speakerphones that distort harmonics and make voice characteristics less distinct.
- High background music or noise beds that interfere with speech clarity.
Overall, we’ve handled all of these challenges extremely well – a testament to our R&D team and background in best-in-class noise suppression.
Our technology is language agnostic and works pretty well on almost any language or unfamiliar accents — although our core models are language‑agnostic, uncommon phonetic patterns with limited training representation can marginally reduce confidence. We can also augment performance by increasing training data sets with problematic use cases.
All of these factors don’t break detection outright, but they can increase ambiguity (i.e., more yellow results) and encourage a human‑in‑the‑loop response.
The app flow includes host permissioning and “one scanner at a time.” What meeting governance model does OmniSpeech expect: security ops, host-only, or any participant as a spot-checker?
We support both host‑driven and participant‑initiated scanning — with controls:
- Only one scan at a time prevents conflicting audio access. If someone else initiates a scan while another scan is in progress, they’ll see a notice asking them to start their scan after the current scan finishes. This is a Zoom-specific feature and we can build integrations where multiple users can scan audio simultaneously.
- Non‑host participants can request permission to scan, but the host must approve before their scan proceeds.
This governance model works well for security teams who want central control, but also allows operational flexibility — for example, a meeting facilitator or auditor can run checks when needed with host consent.
OmniSpeech states meeting audio is processed in real time and discarded. What data (audio vs. metadata) is accessed during scans, and how does OmniSpeech handle consent, retention, and audit verification?
We do not store or record any meeting audio at any time. During a scan:
- Zoom’s Realtime Media Streams (RTMS) securely deliver the audio for the scan in real time.
- OmniSpeech processes the audio frames, makes a detection decision, and immediately discards the audio once the scan completes.
We only surface detection results (red/yellow/green) — no audio content, transcripts, or recordings are retained by OmniSpeech. Consent is explicitly part of the Zoom permissions workflow, and users see system notices when the app is accessing meeting content. That design aligns with enterprise audit expectations and privacy best practices. We adhere to the strictest enterprise and federal government standards for secure communications.
The results present as red/yellow/green “likely deepfake / maybe / likely human.” What operational practices does OmniSpeech recommend after a red or yellow—pause the meeting, verify identity out-of-band, capture evidence, notify IT/security?
Our job is not to create the SOP for an organization. We are simply a monitoring tool designed to inform users that content may be (or likely is AI-generated). We generally recommend a tiered response practice based on confidence level.
We recommend orgs operate on the side of caution for confirmed or highly suspicious indicators while avoiding disruption for low‑risk conversations.
Where does OmniSpeech see the highest near-term risk: executive impersonation inside enterprises, consumer scams, public-sector misinformation, or something else—and what patterns show up in real incidents?
The most immediate risk we see is impersonation fraud in live enterprise and institutional contexts — scenarios where an attacker uses an AI‑generated voice to authorize transactions, access confidential information, or mislead operations. This includes C‑suite impersonations, customer support deception, and credential abuse in high‑stakes meetings. These are not theoretical — deepfake voice scams have already been used in financial fraud and executive impersonations elsewhere in the industry. As the attackers scale, the risk moves across sectors — from corporate to public‑sector misinformation to everyday consumer scams — but live trust authentication is the critical frontier today.
Last year, I believe the statistic is that over half of enterprises reported an AI-generated fraud attempt, and these attacks are ramping up exponentially. We’re also seeing an increase in AI-generated content to hack job interviews, leading to unqualified candidates getting hired, causing tremendous risk and expense for organizations.
AI Detect supports real-time and “post-processing” analysis, and plans limit scans per meeting. What’s the business model and roadmap: team licensing, SOC integrations, reporting/export, and broader platform partnerships beyond Zoom (Teams, Webex)?
AI Detect’s business model is subscription‑based through the Zoom Marketplace, with personal and enterprise tiers. We also offer flexible licensing models to larger enterprises and the federal government. In addition to the Zoom integration, AI Detect™ has a flexible API that can be connected to any voice-enabled platform. We have some exciting consumer and enterprise use cases for our technology, which we will deploy early this year. Our ultimate goal is to embed deepfake audio detection into the communications fabric across consumer and enterprise platforms.
About David “Prz” Przygoda

David “Prz” Przygoda is the CEO of OmniSpeech, a leader in AI voice technology, a role he has held since 2022. With a track record of scaling innovation at the intersection of AI/ML, audio, and human speech, David brings deep expertise in building brands, forging strategic partnerships across the public and private sectors, and commercializing emerging technologies.
Before joining OmniSpeech, David founded Adventures Consulting, a boutique advisory firm supporting tech executives with go-to-market strategy, marketing, and strategic partnership development. He previously served as Chief Marketing Officer at Antares Audio Technologies, the company behind Auto-Tune, where he led a comprehensive brand transformation, introduced its first direct-to-consumer SaaS product, and launched numerous strategic technology partnerships.
Earlier in his career, David held key roles in marketing, business development, and strategic partnerships at leading audio and entertainment technology companies, including THX, Pandora, and SiriusXM. His work focused on accelerating the adoption of cutting-edge audio technologies in automotive, consumer electronics, and pro audio markets.
David holds a BS in Business from Penn State University and both an MBA and MS in e-Business Technology from the University of Maryland.
For more serious insights on AI, click here.
Did you enjoy the David “Prz” Przygoda interview? If so, like, share or comment. Thank you!

Leave a Reply