Back to Blog

To transcribe audio to text, upload your audio file to a speech recognition tool and let the model convert it. Browser-based transcription using the Audio Transcriber runs the full recognition model on your device — the audio file is never sent to a server. This matters for confidential meeting recordings, legal interviews, medical dictation, and any audio containing private information.

The technology behind browser-based ASR (automatic speech recognition) has reached the point where a state-of-the-art model runs locally on consumer hardware, producing accuracy that matches or exceeds many cloud services on clean audio.

Why Private Audio Transcription Needs Local Processing

Most online transcription workflows process audio on remote servers. That architecture is useful for team collaboration, automatic meeting bots, and enterprise integrations, but it is not the right default for every recording.

For a meeting between a company and a prospective acquirer, an interview with a confidential source, a legal consultation, or medical dictation, the audio file itself is sensitive. Uploading it creates a data handling question: where the file is stored, how long it is retained, who can access it, and whether the generated transcript is stored separately.

Browser-based transcription with the Audio Transcriber avoids that upload step. The model runs locally using WebAssembly. The audio file is loaded into browser memory, processed by the model running on your CPU, and the text output is displayed in the browser. Close the tab and the audio is gone. No network request containing your audio is made.

How Automatic Speech Recognition Works

ASR models convert audio waveforms to text using a process that can be thought of in two parts, even when those parts are implemented as a single end-to-end neural network.

The acoustic model component processes the raw audio signal — a sequence of amplitude values at some sample rate. The audio is typically divided into 25ms frames with 10ms overlap, and each frame is converted to a mel spectrogram (a frequency representation that approximates how the human ear perceives sound). The acoustic model maps these spectrogram frames to phoneme or subword unit probabilities.

The language model component scores word sequences based on what combinations of words are likely given the phoneme probabilities. The word "two" and the word "too" have identical phoneme sequences — the language model chooses between them based on context. Without the language model component, ASR output is full of homophones, number/word confusions, and grammatically implausible sequences.

Modern end-to-end models like NVIDIA's Parakeet combine these stages into a single neural network trained on thousands of hours of labeled speech data. Parakeet RNNT-1.1B (the model behind the Audio Transcriber) is trained on approximately 64,000 hours of English speech across diverse domains. Its Word Error Rate (WER) on standard benchmarks is approximately 5-8%, which is competitive with commercial cloud services.

Step-by-Step Transcription Process

Step 1: Prepare your audio file. Trim silence from the beginning and end if possible. This is not strictly necessary, but it prevents long silent openings from occasionally confusing some ASR models.

For recordings over 60 minutes, consider splitting into segments at natural pause points — chapter breaks, topic transitions, breaks in the recording. A single 3-hour recording will process correctly but takes longer to generate and is harder to review as a single block of text.

Step 2: Check your file format. Supported formats include MP3, WAV, M4A, FLAC, and OGG. If you have a video file (MP4, MOV, MKV), the audio track is embedded in it — many transcription tools accept video files directly and extract the audio automatically.

If your file is in an unsupported format (AAC as a standalone file, WMA, AMR), convert it first. VLC (free, cross-platform) can convert between audio formats: Media → Convert.

Step 3: Upload and transcribe. Open the Audio Transcriber and upload your file. The model loads into browser memory — this takes a few seconds. Processing time is roughly 1 minute per 10 minutes of audio on a modern laptop CPU. A 30-minute interview will be ready in about 3 minutes.

Step 4: Review the output. The transcript appears as plain text. Scan through it before copying — pay particular attention to proper nouns, technical terms, and any numbers (phone numbers, dollar figures, dates).

Step 5: Copy and export. Copy the text into your working document, note-taking app, or text editor. At this point you have the raw transcript; see the post-editing section below for how to efficiently clean it up.

Accuracy: Realistic Expectations

The Audio Transcriber achieves WER of approximately 5-8% on standard English benchmarks. What this means in practice:

Clean audio, single speaker, quiet room: WER of 3-5%. A 10-minute interview will have 15-30 word errors out of roughly 1,500 words. Most errors will be proper nouns and specialized vocabulary. Post-editing takes 5-10 minutes.

Meeting recordings, multiple speakers, moderate background noise: WER of 10-20%. A 30-minute call will have errors on specialized terms and anywhere two voices overlap, plus occasional garbled sections during periods of high noise. Post-editing takes 20-40 minutes.

Phone recordings, compressed audio codecs: Quality degrades significantly below 8kHz audio. Traditional telephone audio (G.711 codec, 8kHz) removes frequencies above 4kHz, which carries important consonant information. Transcription accuracy on telephone-quality audio is noticeably lower than on wideband recordings.

Overlapping speech: When two people talk simultaneously, ASR models of all types struggle. The acoustic signal is a superposition of two speakers, and separating them is a hard computational problem independent of transcription quality. The model will typically render one speaker's words with gaps or substitutions during overlap.

Post-Editing the Transcript

Budget your time based on audio quality:

  • Clean recordings: 5-10 minutes of editing per hour of audio
  • Standard meeting recordings: 20-30 minutes per hour
  • Poor quality or heavily overlapping speech: 40-60 minutes per hour

The error patterns from ASR are predictable. If you know what to look for, you can move through a transcript quickly.

Homophones. "Their," "there," and "they're" are the canonical example. "Two," "to," and "too." "Right" and "write." ASR gets these wrong when context is ambiguous, and context is ambiguous more often than you'd expect in transcript-length text. Search for these explicitly in long transcripts rather than waiting to encounter them.

Proper nouns. Names of people, companies, products, and places are the largest source of errors. The model has never been trained specifically on "Nguyen" or "Palantir" or "Benioff." It will make its best phonemic guess, which may or may not be correct. Maintain a list of every proper noun in your recording before transcription and check each one in the output.

Numbers and identifiers. Phone numbers spoken as individual digits (five, five, five, eight, six, seven, five, three, zero, nine) will sometimes be rendered as "five five five eight six seven five three zero nine" and sometimes as a number (5558675309). Neither is wrong — both require editing. Dollar amounts and percentages are usually handled correctly. Model numbers, serial numbers, and codes (like zip codes or reference numbers) are inconsistent.

Sentence boundaries. ASR output typically lacks paragraph breaks and may insert or omit sentence-final periods. For a final transcript used as a formal document, you will need to add paragraph breaks manually. For a transcript being fed into an AI summarization tool, raw run-on text usually works fine — the AI handles the structure.

Language Support

The Audio Transcriber is optimized for English, trained primarily on American and British English speech data. It handles a wide range of English accents well.

For other languages, Whisper is the better option. OpenAI's Whisper model supports 57+ languages with automatic language detection. Whisper runs in-browser via whisper.cpp (WebAssembly port), maintaining the same local-processing privacy model. For multilingual recordings (a meeting that switches between English and Spanish, or a French interview), Whisper is the right tool.

The accuracy hierarchy for non-English languages in Whisper: European languages with large internet corpora (French, German, Spanish, Italian, Portuguese) achieve WER near 5-10%. Less-resourced languages achieve higher WER. Quality correlates with language prevalence in Whisper's training data.

Use Cases Where Local Processing Matters Most

Meeting transcriptions are the most common use case where the privacy argument is concrete. Business strategy, personnel matters, board discussions, client negotiations — standard meeting recordings contain content that belongs under confidentiality. Browser-based transcription means none of it leaves the device.

Legal, medical, and HR audio carries stricter handling requirements. Attorney-client conversations, covered medical dictation, and HR investigation recordings are categories where transmitting the audio to a third party needs deliberate authorization. Local processing keeps the transcription step on the device.

Journalists and researchers who conduct interviews with confidential sources have similar needs. The audio should not be on a third-party server in any form.

Podcast transcript generation is a case where privacy matters less but local processing still has practical advantages: no upload wait time for large files, no per-minute charges, no account required.