Whisper Model Sizes: Which One Should You Use?
WisperCode Team · January 27, 2026 · 11 min read
TL;DR: Whisper comes in five sizes. Base is the best starting point for most users. Small offers noticeably better accuracy if you have 2 GB or more of RAM to spare. Large-v3 is the gold standard for accuracy but needs around 10 GB of RAM and benefits significantly from a GPU. Pick the largest model your hardware can run comfortably.
Whisper Model Overview
OpenAI released Whisper in multiple sizes so the same speech recognition technology can run on everything from a Raspberry Pi to a workstation with a high-end GPU. Each step up in size brings better accuracy at the cost of more RAM, more disk space, and slower processing.
If you are not familiar with Whisper itself, start with What is OpenAI Whisper for background on how the model works, its strengths, and its limitations. This article focuses specifically on choosing the right model size for your hardware and use case.
Complete Model Comparison
| Model | Parameters | RAM Required | Disk Size | Relative Speed | English WER (Approx) | Best For |
|---|---|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~75 MB | 10x (fastest) | ~7.7% | Low-end hardware, quick testing |
| base | 74M | ~1 GB | ~150 MB | 7x | ~5.5% | Daily dictation, general use |
| small | 244M | ~2 GB | ~500 MB | 4x | ~4.4% | Better accuracy, accented speech |
| medium | 769M | ~5 GB | ~1.5 GB | 2x | ~3.9% | Professional transcription |
| large-v3 | 1.55B | ~10 GB | ~3 GB | 1x (slowest) | ~3.0% | Maximum accuracy, multilingual |
The speed column shows approximate relative speed compared to large-v3. A "10x" means the tiny model processes audio roughly ten times faster. WER stands for word error rate, the percentage of words the model gets wrong on clean English audio. Lower is better.
These numbers are approximate and vary depending on your hardware, audio quality, and the content being transcribed. They serve as a useful baseline for comparison.
Tiny
The tiny model is Whisper's smallest offering at 39 million parameters and roughly 75 MB on disk. It runs on virtually any hardware, including single-board computers like the Raspberry Pi 4.
When to use it: You are testing Whisper for the first time and want instant results. You are running on very constrained hardware with less than 2 GB of available RAM. You need the absolute fastest transcription possible and can tolerate more errors.
The trade-off: Accuracy drops noticeably compared to larger models, especially with accented speech, technical terminology, and background noise. The tiny model makes roughly 40% more errors than the base model on clean English audio, and the gap widens in challenging conditions. You will see more misheard words, missed punctuation, and garbled proper nouns.
Realistic expectation: Fine for quick drafts and informal notes. Not reliable enough for anything where accuracy matters.
Base
The base model doubles the parameter count to 74 million while staying at roughly 1 GB of RAM usage. It is the default model in WisperCode and the one we recommend as a starting point for most users.
When to use it: You want a good balance between speed and accuracy on any modern machine. You are doing everyday voice dictation: emails, documents, messages, and notes. Your hardware has at least 4 GB of total RAM.
Why it works well for dictation: For spoken English in a reasonably quiet environment, the base model delivers clean, usable text. Transcription typically takes one to two seconds for a sentence on Apple Silicon or a recent Intel or AMD processor. The error rate is low enough that you spend more time dictating and less time correcting mistakes.
Realistic expectation: Handles standard English dictation well. Occasional mistakes on unusual names, technical jargon, and heavily accented speech. A solid daily driver.
Small
The small model triples the base model's parameters to 244 million. It needs roughly 2 GB of RAM and 500 MB of disk space. This is where accuracy starts to feel genuinely reliable.
When to use it: You have 8 GB or more of total RAM and want better accuracy than the base model. You regularly dictate technical terms, medical vocabulary, or work with accented English. You work in noisier environments where the base model makes too many mistakes.
What improves: The small model handles accents noticeably better. Words that the base model consistently misheard start coming through correctly. Background noise causes fewer errors. Technical vocabulary like software terms, scientific names, and legal phrases are transcribed more reliably.
Realistic expectation: A meaningful step up from base. If you found the base model "close but not quite" for your workflow, the small model will likely fix most of those issues. Processing is about half the speed of base, but still fast enough for comfortable dictation on modern hardware.
Medium
The medium model is a serious step up at 769 million parameters. It needs roughly 5 GB of RAM and 1.5 GB of disk space. Processing takes about twice as long as the small model.
When to use it: You prioritize accuracy over speed. You are doing professional transcription where errors are costly. You dictate medical notes, legal documents, or financial reports. You regularly work with non-English languages. See our guide on voice dictation for sensitive documents for workflows where this level of accuracy matters.
What improves: Error rates drop below 4% on clean English. The medium model handles complex sentences, domain-specific terminology, and code-switching between languages more gracefully. Punctuation placement is more reliable. Proper nouns and brand names that stumped smaller models are recognized more consistently.
Realistic expectation: Excellent accuracy for professional use. The processing delay is noticeable (two to four seconds for a typical dictation) but not disruptive for most workflows. Needs a machine with at least 8 GB of total RAM, ideally 16 GB.
Large-v3
The large-v3 model is Whisper's flagship at 1.55 billion parameters. It needs roughly 10 GB of RAM and 3 GB of disk space. It is the slowest model but delivers the best accuracy available.
When to use it: You need the highest possible accuracy and have the hardware to support it. You transcribe audio in multiple languages or switch between languages within a single session. You work with difficult audio: heavy accents, background noise, overlapping speakers, or low-quality recordings. You are doing professional transcription where every word matters.
What improves: Word error rate drops to approximately 3% on clean English, approaching the accuracy of professional human transcribers (2-4% WER). Multilingual performance improves dramatically. Large-v3 supports 99 languages, and the accuracy gap between English and other languages is smaller than with any other model size. Rare words, proper nouns, and technical terms are handled with the most reliability.
Realistic expectation: The gold standard for Whisper accuracy. Requires a machine with 16 GB or more of RAM. Benefits significantly from a dedicated GPU (NVIDIA with CUDA) or an M-series Mac with unified memory. Processing takes five to ten seconds for a typical dictation on CPU, or one to two seconds on a good GPU. If your hardware can handle it, this is the model to use.
How to Choose: Decision Flowchart
Start with how much RAM your machine has, then consider your use case.
Under 2 GB available RAM: Use tiny. It is your only realistic option. Upgrade your hardware when you can.
2-4 GB available RAM: Use base. It gives you the best accuracy-to-resource ratio in this range.
4-8 GB available RAM: Start with base. Try small if you want better accuracy and the slight speed reduction is acceptable.
8-16 GB available RAM: Use small as your default. Try medium if accuracy is a priority and you do not mind the slower processing.
16 GB or more with GPU or Apple Silicon: You can run any model. Start with small or medium for daily dictation. Use large-v3 when you need maximum accuracy, multilingual support, or are dealing with difficult audio.
For more guidance on getting Whisper and other AI models running efficiently on your hardware, see our guide to running AI models locally.
Changing Models in WisperCode
Switching between models in WisperCode takes about thirty seconds.
- Open Settings.
- Navigate to the Model section.
- Select the model size you want.
- If you have not downloaded that model before, the download starts automatically. Models range from 75 MB (tiny) to 3 GB (large-v3), so download time depends on your internet speed.
- Once downloaded, the model is cached locally. Switching back to a previously downloaded model is instant.
You can have multiple models downloaded at the same time. There is no need to delete one before downloading another. WisperCode only loads the active model into memory, so unused models just sit on disk without consuming RAM.
Accuracy Tips Beyond Model Size
Picking the right model is the single biggest lever for accuracy, but it is not the only one. Here are three things that improve transcription quality regardless of which model you use.
Vocabulary hints. Every Whisper model supports an initial prompt parameter that steers the model toward expected terms. If you regularly dictate words like "Kubernetes," "HIPAA," "amoxicillin," or your company's brand names, adding them as vocabulary hints makes a measurable difference. WisperCode lets you define a custom dictionary that is automatically passed to Whisper. See our vocabulary hints guide for a step-by-step walkthrough.
A good microphone. Audio quality has a direct impact on transcription accuracy across all model sizes. A dedicated USB microphone or a quality headset produces cleaner audio than a built-in laptop microphone, which means fewer errors. The difference is especially noticeable with smaller models that are more sensitive to background noise. Our best microphones for voice dictation guide covers specific recommendations.
A quiet environment. Background noise is the single biggest source of transcription errors. Air conditioning, keyboard clicks, conversations in the next room, and television audio all degrade accuracy. When possible, dictate in a quiet space. If that is not an option, use a directional microphone or headset to reduce ambient noise pickup.
Frequently Asked Questions
Can I have multiple models downloaded at the same time?
Yes. Each model is stored as a separate file on disk. You can download all five sizes and switch between them freely. Only the model you are actively using is loaded into RAM. The others sit on disk and take up storage space but nothing else. If disk space is limited, you can delete models you do not use from WisperCode's settings.
How long does it take to download a model?
It depends on the model size and your internet speed. On a typical broadband connection (50 Mbps), the tiny model downloads in under five seconds, the base model in about ten seconds, and the small model in about thirty seconds. The medium model takes one to two minutes, and large-v3 takes three to five minutes. These are one-time downloads. Once cached, you never need to download the same model again.
Will a bigger model make my computer slow?
Only during active transcription, which typically takes one to ten seconds depending on the model and your hardware. While the model is processing audio, it uses significant CPU or GPU resources. Once transcription is complete, those resources are freed. For dictation use, where you speak for a few seconds at a time and then continue working, even the large-v3 model does not cause noticeable slowdown in your other applications. WisperCode runs transcription in a background thread to keep your system responsive.
Is large-v3 worth it?
If you have the hardware (16 GB RAM, ideally with a GPU or Apple Silicon), and you need the best possible accuracy, yes. The jump from small to large-v3 reduces word errors by roughly 30%. That means fewer corrections, less editing, and more trust in your transcriptions. For multilingual use, the improvement is even more significant. If your hardware can handle it comfortably and accuracy is important to your workflow, large-v3 is worth the extra resources.
Try WisperCode free during beta → Download
Related Articles
What Is OpenAI Whisper? A Plain-English Guide
OpenAI Whisper is an open-source speech recognition model that runs locally on your device. Learn how it works, which model to pick, and why it matters for privacy.
February 7, 2026 · 15 min read
Best Voice Dictation Software in 2026
A detailed comparison of the best voice dictation tools in 2026, including WisperCode, Dragon, macOS Dictation, Windows Speech, and more. Privacy, accuracy, and price compared.
February 6, 2026 · 18 min read
Why Local Speech Recognition Changes Everything
Cloud-based dictation is convenient. Local dictation is better. Here is why we bet everything on on-device processing.
February 5, 2026 · 13 min read