How to Build an AI Medical Scribe with Voice Agents

Written by assemblyai | Published 2026/04/13
Tech Story Tags: ai-voice-agents | medical-ai-scribe | voice-agent-tutorial | ai-medical-scribe-architecture | soap-note-generation-ai | clinical-nlp-pipeline | healthcare-voice-ai-agents | good-company

TLDRThis article breaks down how to build a production-grade AI medical scribe using a three-stage pipeline: speech-to-text, clinical NLP, and documentation generation. It argues that speech recognition accuracy—not the LLM—is the primary bottleneck, especially in complex clinical environments with specialized vocabulary and overlapping speakers. The piece also covers real-time streaming, SOAP note generation, and HIPAA compliance, highlighting that reliable healthcare AI depends on getting the transcription layer right before anything else.via the TL;DR App

Physicians spend an average of two hours on documentation for every one hour of patient care. That’s not a workflow problem—it’s a technology problem. And it’s exactly the kind of problem that voice agents are built to solve.


Ambient AI scribes—systems that passively listen to doctor-patient conversations and generate structured clinical notes—have become one of the fastest-growing categories in healthcare AI. The market hit $600 million in 2025 and is accelerating. Products like Nuance DAX, Abridge, and Heidi Health have proven the demand. But many of these solutions are closed platforms with six-figure enterprise contracts, opinionated workflows, and limited customization.


Here’s the thing: the core architecture of an ambient AI scribe isn’t magic. It’s a pipeline. And if you’re a developer building in healthcare, you can construct a competitive scribe with open tools, a streaming speech-to-text API, and an LLM—in a fraction of the time and cost you’d expect.


This article walks through the architecture, the hard technical problems, and the implementation details that separate a demo from a production medical scribe.

The three-stage pipeline behind every ambient scribe

Every ambient scribe—whether it’s a $200K enterprise deployment or a weekend prototype—runs the same fundamental pipeline (for a deeper look at how these components fit together, see the Voice AI stack for building agents):


Stage 1: Speech-to-text. A streaming speech recognition model captures the doctor-patient conversation in real time, converting spoken audio into text. This is the foundation. If the transcript is wrong, everything downstream breaks.


Stage 2: Clinical NLP. A large language model processes the raw transcript and extracts structured clinical meaning—identifying symptoms, medications, diagnoses, procedures, and their relationships. It distinguishes between what the patient reports (subjective) and what the clinician observes (objective).


Stage 3: Documentation generation. The LLM formats the extracted information into standard clinical note templates—most commonly SOAP notes (Subjective, Objective, Assessment, Plan)—ready for EHR integration.

The pipeline sounds straightforward. The difficulty is entirely in the details.

Why speech-to-text accuracy is the bottleneck (not the LLM)

Most developers building their first medical scribe focus on the LLM—which prompt template to use, how to structure SOAP notes, which model generates the best clinical language. That’s the wrong place to start.


The single most important factor in medical scribe quality is the accuracy of the speech-to-text layer. If the transcription mishears “metoprolol” as “metroprolol” or drops “10mg” entirely, no amount of prompt engineering will recover that information. The LLM will confidently generate a note with the wrong medication or missing dosage—and in healthcare, that’s not just a bad user experience. It’s a patient safety issue.


Medical conversations are uniquely challenging for speech recognition:


  • Specialized vocabulary. Drug names like “hydroxychloroquine,” “lisinopril,” and “metformin” sound nothing like everyday English. A general-purpose model trained on podcasts and meetings will regularly butcher them.


  • Overlapping speakers. Patients and providers talk over each other, family members interject, and nurses pop in and out. The system needs speaker diarization that correctly attributes who said what.


  • Far-field audio. Ambient scribes record from a phone or tablet across the room—not a close-talking headset mic. Background noise from medical equipment, hallway conversations, and HVAC systems degrades audio quality.


  • Accents and speech patterns. Patients range from elderly individuals speaking softly to non-native speakers mixing languages mid-sentence.


The missed entity rate—the percentage of medical terms the model fails to transcribe correctly—is the metric that separates usable medical scribes from dangerous ones. Recent benchmarks show dramatic differences between providers: some models miss medical entities at rates of 8–24%, while purpose-built medical speech recognition reduces that to under 5%.

Building the speech-to-text layer for clinical conversations

Let’s get into implementation. Your speech-to-text layer for a medical scribe needs four capabilities that general transcription doesn’t require.

Medical terminology recognition

Standard speech-to-text models aren’t trained on medical corpora at the depth needed for clinical documentation. You need a model—or a model configuration—that’s specifically tuned for medical vocabulary.


One approach is Medical Mode, an add-on available on AssemblyAI’s Universal-3 Pro Streaming model that enhances accuracy for medication names, procedures, conditions, and dosages.


You enable it with a single connection parameter:

CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "u3-rt-pro",
    "domain": "medical-v1",  # Enables Medical Mode
    "speaker_labels": "true",
    "keyterms_prompt": json.dumps(["Lisinopril", "Metformin", "Humalog"])
}

The keyterms_prompt parameter is especially powerful for medical use cases. You can feed in up to 1,000 terms—patient-specific medications, specialty terminology, provider names—and the model biases toward recognizing them correctly. If you know the patient takes “ramipril” and “gliclazide” before the visit starts, passing those terms dramatically reduces transcription errors.


Medical Mode reduces missed medical entities by over 20% compared to the base model alone. In benchmark testing:

Provider

Missed entity rate

Model

AssemblyAI

3.2%

Universal-3 Pro + Medical Mode

Deepgram

4.7%

Nova-3 Medical

Amazon

8.7%

Transcribe Medical

Google

24.4%

Medical Conversation


Lower is better. Source: AssemblyAI medical terminology benchmarks.

Speaker diarization with clinical roles

A medical scribe needs to know who said what. “I’ve been having chest pain for two weeks” means something very different if the patient said it versus the doctor quoting a textbook example.


Standard speaker diarization labels speakers as “Speaker A” and “Speaker B.” For clinical notes, you need role-based identification that maps to “Doctor” and “Patient”:

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro", "universal-2"],
    speaker_labels=True,
    speakers_expected=2,
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "role",
                "known_values": ["Doctor", "Patient"]
            }
        }
    }
)

This directly feeds into SOAP note generation. Patient utterances map to the Subjective section, while provider observations go into the Objective and Assessment sections.

Streaming with sub-second latency

For an ambient scribe, you’re processing audio in real time as the conversation happens. The transcription needs to keep pace with natural speech—which means sub-second latency on the speech-to-text layer.


AssemblyAI’s Universal-3 Pro Streaming model delivers approximately 300ms latency with an immutable transcript architecture: finalized words never change, so your downstream LLM pipeline can begin processing partial results immediately without worrying about corrections invalidating earlier work.


You can also update configuration mid-session—adding new keyterms as the conversation reveals the patient’s medication list, for example—without disconnecting.

Context-aware prompting for clinical environments

Beyond keyterms, you can pass natural language prompts (up to 1,500 words) to guide transcription behavior. For a medical scribe, this is invaluable:

{
  "prompt": "Produce a transcript for a clinical history
  evaluation. It's important to capture medication and
  dosage accurately. Every disfluency is meaningful data.
  Include: fillers (um, uh), repetitions, restarts,
  and stutters."
}

In clinical settings, disfluencies matter. A patient stuttering over a medication name—“glycosi—glycosi—glycoside”—tells the provider they’re uncertain. A scribe that smooths this into “glycoside” loses clinically relevant information.

From transcript to SOAP notes: the LLM layer

Once you have an accurate, speaker-labeled transcript, the LLM layer structures it into clinical documentation. This is where most developers feel more comfortable—but there are healthcare-specific pitfalls to watch for.

Structuring the prompt for SOAP generation

Your LLM needs to produce four distinct sections:

  • Subjective: What the patient reports—symptoms, complaints, history, concerns. Sourced primarily from patient-labeled utterances.
  • Objective: Clinician observations, vital signs, exam findings. Sourced from provider-labeled utterances describing what they observed or measured.
  • Assessment: Diagnoses, clinical impressions, differential diagnoses. This is the clinician’s interpretation.
  • Plan: Treatment recommendations, prescriptions, referrals, follow-up instructions.


A production prompt should explicitly instruct the LLM to only include information that appears in the transcript—never to infer or add details that weren’t discussed. LLM hallucination in clinical notes isn’t a minor inconvenience; it’s a liability.

The hybrid approach: streaming + post-visit processing

The most robust medical scribe architectures use a two-pass approach:

  • During the visit: Stream audio through a real-time model (like Universal-3 Pro Streaming with Medical Mode) for live transcription with low latency.
  • After the visit: Run the complete audio through a pre-recorded model (Universal-3 Pro) for maximum accuracy with full context. This produces the final speaker-labeled transcript that feeds into SOAP note generation.


The streaming pass gives clinicians immediate feedback—they can glance at the transcript during the visit to confirm the system is capturing correctly. The post-visit pass produces the definitive transcript used for the actual clinical note.


AssemblyAI’s LLM Gateway can route the final transcript directly to GPT, Claude, Gemini, or open-source LLMs for structured note generation—all through a single API, without managing separate LLM provider integrations.

Handling HIPAA and security in production

Every healthcare voice application processes Protected Health Information (PHI). You can’t deploy a medical scribe without addressing security and compliance from day one.


Key requirements for a production deployment:

  • Business Associate Agreement (BAA). Any vendor processing PHI needs a BAA in place. AssemblyAI is considered a business associate under HIPAA and offers a Business Associate Addendum to ensure PHI is appropriately safeguarded.
  • Automatic PHI redaction. Your pipeline should support redacting sensitive information—patient names, dates of birth, Social Security numbers—from both text transcripts and audio recordings before storage.
  • Encryption in transit and at rest. TLS for all API connections, encrypted storage for any audio or transcript data you retain.
  • Audit logging. Immutable records of who accessed what data and when. This isn’t optional for HIPAA.
  • Data retention policies. Configure automatic deletion of audio recordings and transcripts after your compliance-required retention period.


AssemblyAI holds SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certifications—providing independent validation of security controls that healthcare deployments require.

What this means for startups building in healthcare AI

The ambient AI scribe market is large and growing, but the barrier to building a competitive product is lower than most people think. The hardest technical problem—accurate medical speech recognition in noisy, multi-speaker clinical environments—is now available through APIs rather than requiring years of in-house model training.


The real differentiation for medical scribe startups is happening at the application layer (for a deeper dive on build vs. buy tradeoffs, see Building a medical scribe startup in 2026): specialty-specific note templates, EHR integration depth, workflow design for different practice types, and the review experience that makes clinicians trust the output.


If you’re building in this space, start with the speech-to-text foundation. Get the transcript right, and everything else follows. Get it wrong, and no amount of LLM sophistication will save you.

Frequently asked questions

What is an AI medical scribe and how does it work?

An AI medical scribe is software that listens to doctor-patient conversations and automatically generates structured clinical notes like SOAP documentation. It works through a three-stage pipeline: streaming speech-to-text converts the conversation to text in real time, an LLM extracts clinical meaning and identifies medical entities, and a documentation layer formats the output into standard clinical templates for EHR integration.

What is the best speech-to-text API for building AI medical scribes?

The best speech-to-text API for medical scribes needs purpose-built medical terminology recognition, speaker diarization, streaming capability, and security infrastructure with Business Associate Agreement support. AssemblyAI’s Universal-3 Pro Streaming model with Medical Mode achieves a 3.2% missed entity rate on medical terminology—the lowest among major providers, compared to 4.7% for Deepgram, 8.7% for Amazon Transcribe Medical, and 24.4% for Google Medical Conversation.

How do voice agents differ from ambient AI scribes in healthcare?

Voice agents are interactive AI systems that handle two-way conversations—like scheduling appointments or verifying insurance over the phone. Ambient AI scribes are passive listeners that capture existing doctor-patient conversations and generate documentation. Both rely on accurate speech-to-text as their foundation, and increasingly, healthcare platforms combine both capabilities for end-to-end clinical workflow automation.

Can you build an AI medical scribe like Nuance DAX or Abridge?

Yes. Products like Nuance DAX and Abridge are built on the same fundamental pipeline: speech-to-text, clinical NLP, and documentation generation. Using a streaming speech-to-text API with medical terminology support, an LLM for structured note generation, and proper HIPAA infrastructure, a small engineering team can build a functional ambient scribe. The key differentiator is speech recognition accuracy on medical vocabulary—the foundation that determines overall scribe quality.

How do you handle HIPAA compliance when building a medical scribe?

HIPAA compliance for medical scribes requires a Business Associate Agreement with every vendor processing Protected Health Information, encryption of audio and transcripts in transit and at rest, automatic PHI redaction capabilities, role-based access controls, and immutable audit logging. AssemblyAI offers a Business Associate Addendum and holds SOC 2 Type 2, ISO 27001:2022, and PCI DSS v4.0 certifications for healthcare deployments.

How accurate does speech-to-text need to be for clinical documentation?

Medical speech recognition accuracy directly impacts patient safety. A missed or misheard medication name, dosage, or diagnosis can lead to incorrect treatment decisions. Purpose-built medical speech recognition—using features like Medical Mode and keyterms prompting—reduces missed medical entity rates to under 5%, compared to 8–24% for general-purpose models. For clinical documentation, anything above a 5% missed entity rate on medical terminology introduces unacceptable risk.



Written by assemblyai | AssemblyAI builds advanced speech language models that power next-generation voice AI applications.
Published by HackerNoon on 2026/04/13