Voxtral Transcribe 2

simonw · 2026-02-04T16:21:17.000Z 1770222077

This demo is really impressive: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

Don't be confused if it says "no microphone", the moment you click the record button it will request browser permission and then start working.

I spoke fast and dropped in some jargon and it got it all right - I said this and it transcribed it exactly right, WebAssembly spelling included:

> Can you tell me about RSS and Atom and the role of CSP headers in browser security, especially if you're using WebAssembly?

Oras · 2026-02-04T16:25:22.000Z 1770222322

Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.

I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.

druskacik · 2026-02-04T18:11:16.000Z 1770228676

According to the announcement blog Le Chat is powered by the new model as well: https://chat.mistral.ai/chat

tekacs · 2026-02-04T16:41:58.000Z 1770223318

Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.

And open weight too! So grateful for this.

daemonologist · 2026-02-04T16:48:11.000Z 1770223691

404 on https://mistralai-voxtral-mini-realtime.hf.space/gradio_api/... for me (which shows up in the UI as a little red error in the top right).

jaggederest · 2026-02-04T17:03:00.000Z 1770224580

It can transcribe Eminem's Rap God fast sequence, really, really impressive.

rafram · 2026-02-04T17:32:04.000Z 1770226324

That's almost certainly in the training data, to be fair.

keeganpoppen · 2026-02-04T18:49:07.000Z 1770230947

what a great test hahah

pyprism · 2026-02-04T17:18:52.000Z 1770225532

Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.

derefr · 2026-02-04T17:51:28.000Z 1770227488

Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.

keeganpoppen · 2026-02-04T18:49:59.000Z 1770230999

it must have some exposure to bengali— just not enough for them to advertise it. otherwise it would have a damn hard time.

carbocation · 2026-02-04T18:10:06.000Z 1770228606

This model was able to transcribe Bad Bunny lyrics over the sound of the background music, played casually from my speakers. Impressive, to me.

sheepscreek · 2026-02-04T18:04:33.000Z 1770228273

I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!

rafram · 2026-02-04T17:35:38.000Z 1770226538

Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.

yko · 2026-02-04T19:10:46.000Z 1770232246

Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.

But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.

breisa · 2026-02-04T19:13:43.000Z 1770232423

Thats just the result of the model only supporting russian (and 12 other languages) and not urkainian. It maps to the closest words from training data.

iagooar · 2026-02-04T18:32:56.000Z 1770229976

In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.

I tried English + Polish:

> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.

yko · 2026-02-04T19:13:07.000Z 1770232387

That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.

tdb7893 · 2026-02-04T18:37:21.000Z 1770230241

Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"

mystifyingpoi · 2026-02-04T18:36:10.000Z 1770230170

TBH ChatGPT does the same, when I mix Polish and English. Generally getting some cyrillic characters and it gets super confused.

dmix · 2026-02-04T16:07:16.000Z 1770221236

> At approximately 4% word error rate on FLEURS and $0.003/min

Amazons transcription service is $0.024 per minute, pretty big difference https://aws.amazon.com/transcribe/pricing/

mdrzn · 2026-02-04T16:09:38.000Z 1770221378

Is it 0.003 per minute of audio uploaded, or "compute minute"?

For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.

Oras · 2026-02-04T16:30:51.000Z 1770222651

I think the point is having it for real-time; this is for conversations rather than transcribing audio files.

jamilton · 2026-02-04T17:52:51.000Z 1770227571

That quote was for the non-realtime model.

janalsncm · 2026-02-04T17:41:39.000Z 1770226899

I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.

https://aclanthology.org/2025.findings-acl.87/

decide1000 · 2026-02-04T17:51:35.000Z 1770227495

I think this model proves it's very efficient and accurate.

popalchemist · 2026-02-04T18:27:54.000Z 1770229674

It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.

javier123454321 · 2026-02-04T18:55:53.000Z 1770231353

yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate: "Play señorita a mi me gusta su style on spotify" correctly

keeganpoppen · 2026-02-04T18:51:55.000Z 1770231115

uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.

make3 · 2026-02-04T19:06:24.000Z 1770231984

model size directly affects latency

numbers · 2026-02-04T19:20:41.000Z 1770232841

does anyone know if there's any desktop tools I can use this transcription model with? e.g. something where like Wisper Flow/WillowVoice but with custom model selection

pietz · 2026-02-04T16:47:16.000Z 1770223636

Do we know if this is better than Nvidia Parakeet V3? That has been my go-to model locally and it's hard to imagine there's something even better.

m1el · 2026-02-04T18:29:05.000Z 1770229745

I've been using nemotron ASR with my own ported inference, and happy about it:

https://huggingface.co/nvidia/nemotron-speech-streaming-en-0...

https://github.com/m1el/nemotron-asr.cpp https://huggingface.co/m1el/nemotron-speech-streaming-0.6B-g...

Multicomp · 2026-02-04T19:19:59.000Z 1770232799

I'm so amazed to find out just how close we are to the start trek voice computer.

I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.

And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.

But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.

And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.

Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.

[1] https://github.com/cjpais/Handy

czottmann · 2026-02-04T17:54:16.000Z 1770227656

I liked Parakeet v3 a lot until it started to drop whole sentences, willy-nilly.

tylergetsay · 2026-02-04T17:01:15.000Z 1770224475

I've been using Parakeet V3 locally and totally ancedotaly this feels more accurate but slightly slower

whinvik · 2026-02-04T18:01:27.000Z 1770228087

Came here to ask the same question!

fph · 2026-02-04T18:50:39.000Z 1770231039

Is there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.

observationist · 2026-02-04T15:53:39.000Z 1770220419

Native diarization, this looks exciting. edit: or not, no diarization in real-time.

https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

~9GB model.

coder543 · 2026-02-04T16:16:09.000Z 1770221769

The diarization is on Voxtral Mini Transcribe V2, not Voxtral Mini 4B.

observationist · 2026-02-04T16:30:57.000Z 1770222657

Ahh, yeah, and it's explicitly not working for realtime streams. Good catch!

sbrother · 2026-02-04T16:54:40.000Z 1770224080

Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..

coder543 · 2026-02-04T17:22:27.000Z 1770225747

> Do you have experience with that model

No, I just heard about it this morning.

ashenke · 2026-02-04T19:03:41.000Z 1770231821

You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see

mdrzn · 2026-02-04T16:03:32.000Z 1770221012

There's no comparison to Whisper Large v3 or other Whisper models..

Is it better? Worse? Why do they only compare to gpt4o mini transcribe?

tekacs · 2026-02-04T16:11:19.000Z 1770221479

WER is slightly misleading, but Whisper Large v3 WER is classically around 10%, I think, and 12% with Turbo.

The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.

But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.

tekacs · 2026-02-04T16:12:00.000Z 1770221520

On the topic of things being misleading, GPT-4o transcriber is a very _different_ transcriber to Whisper. I would say not better or worse, despite characterizations such. So it is a little difficult to compare on just the numbers.

There's a reason that quite a lot of good transcribers still use V2, not V3.

satvikpendem · 2026-02-04T16:41:08.000Z 1770223268

Different how?

GaggiX · 2026-02-04T16:07:35.000Z 1770221255

Gpt4o mini transcribe is better and actually realtime. Whisper is trained to encode the entire audio (or at least 30s chunks) and then decode it.

mdrzn · 2026-02-04T16:10:28.000Z 1770221428

So "gpt4o mini transcribe" is not just whisper v3 under the hood? Btw it's $0.006 / minute

For Whisper API online (with v3 large) I've found "$0.00125 per compute second" which is the cheapest absolute I've ever found.

breisa · 2026-02-04T19:05:21.000Z 1770231921

Deepinfra offers Whisper V3 at 0.00045$ / minute of transcribed audio.

GaggiX · 2026-02-04T16:13:00.000Z 1770221580

>So it's not just whisper v3 under the hood?

Why it should be Whisper v3? They even released an open model: https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-26...

emmettm · 2026-02-04T16:09:51.000Z 1770221391

The linked article claims the average word error rate for Voxtral mini v2 is lower than GPT-4o mini transcribe

GaggiX · 2026-02-04T16:11:11.000Z 1770221471

Gpt4o mini transcribe is better than whisper, the context is the parent comment.

gwerbret · 2026-02-04T18:19:15.000Z 1770229155

I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.

satvikpendem · 2026-02-04T16:39:08.000Z 1770223148

Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.

breisa · 2026-02-04T19:10:54.000Z 1770232254

Not sure if its "realtime" but the recently released VibeVoice-ASR from Microsoft does do diarization. https://huggingface.co/microsoft/VibeVoice-ASR

jiehong · 2026-02-04T18:46:43.000Z 1770230803

It’s nice, but the previous version wasn’t actually that great compared to Parakeet for example.

We need better independent comparison to see how it performs against the latest Qwen3-ASR, and so on.

I can no longer take at face value the cherry picked comparisons of the companies showing off their new models.

For now, NVIDIA Parakeet v3 is the best for my use case, and runs very fast on my laptop or my phone.

nodja · 2026-02-04T18:50:47.000Z 1770231047

There is https://huggingface.co/spaces/hf-audio/open_asr_leaderboard but it hasn't been updated for half a year.

archb · 2026-02-04T18:53:50.000Z 1770231230

I like Parakeet as well and use it via Handy on Mac. What app are you using on your phone?

jiehong · 2026-02-04T18:58:10.000Z 1770231490

Spokenly has it on Mac and iOS, in both cases for free when using parakeet

aavci · 2026-02-04T16:40:40.000Z 1770223240

What's the cheapest device specs that this could realistically run on?

kamranjon · 2026-02-04T17:10:33.000Z 1770225033

I haven't quite figured out if the open weights they released on huggingface amount to being able to run the (realtime) model locally - i hope so though! For the larger model with diarization I don't think they open sourced anything.

XCSme · 2026-02-04T17:43:21.000Z 1770227001

Is it me or error rate of 3% is really high?

If you transcribe a minute of conversation, you'll have like 5 words transcribed wrongly. In an hour podcast, that is 300 wrongly transcribed words.

cootsnuck · 2026-02-04T17:46:47.000Z 1770227207

The error rate for human transcription can be as high as 5%.

XCSme · 2026-02-04T17:53:03.000Z 1770227583

Oh wow, I thought humans are like 0.1% error rate, if they are native speakers and aware of the subject being discussed.

zipy124 · 2026-02-04T19:01:47.000Z 1770231707

I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).

[0] https://www.microsoft.com/en-us/research/wp-content/uploads/...

antirez · 2026-02-04T16:15:03.000Z 1770221703

Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.

nindalf · 2026-02-04T18:02:40.000Z 1770228160

I love seeing people from other countries share their own folk tales about what makes their countries special and unique. I've seen it up close in my country and I always cringed when I heard my fellow countrymen came up with these stories. In my adulthood I'm reassured that it happens everywhere and I find it endearing.

On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.

Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594

antirez · 2026-02-04T18:16:46.000Z 1770229006

Different representations at the same bitrate may have features that make one a lot more resilient to errors. This thing about Italian, you fill find in any benchmark of vastly different AI transcribing models. You can find similar results also on the way LLMs mostly trained on English generalize usually very well with Italian. All this despite Italian accounting for marginal percentage of the training set. How do you explain that? I always cringe when people refute evidence.

testdelacc1 · 2026-02-04T18:42:07.000Z 1770230527

Where is this evidence you’ve cited for your claims?

Archelaos · 2026-02-04T16:36:27.000Z 1770222987

This is largely due to the fact that modern Italian is a systematised language that emerged from a literary movement (whose most prominent representative is Alessandro Manzoni) to establish a uniform language for the Italian people. At the time of Italian unification in 1861, only about 2.5% of the population could speak this language.

gbalduzzi · 2026-02-04T16:51:47.000Z 1770223907

The language itself was not invented for the purpose: it was the language spoken in Florence, than adopted by the literary movement and than selected as the national language.

It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language

gbalduzzi · 2026-02-04T16:45:27.000Z 1770223527

I was honestly surprised to find it in the first place, because I assumed English to be at first place given the simpler grammar and the huge dataset available.

I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)

riffraff · 2026-02-04T16:58:53.000Z 1770224333

English has a ton of homophones, way more sounds that differ slightly (long/short vowels), and major pronunciation differences across major "official" languages (think Australia/US/Canada/UK).

Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)

hackyhacky · 2026-02-04T17:53:40.000Z 1770227620

> the most phonetically advanced human language

That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.

/s obviously.

Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.

NewsaHackO · 2026-02-04T16:49:50.000Z 1770223790

The only knowledge I have about how difficult Italian is comes from Inglourious Basterds.

mmooss · 2026-02-04T17:26:22.000Z 1770225982

At least some relatively well-known research finds that all languages have similar information density in terms of bits/second (~39 bits/second based on a quick search). Languages do it with different amounts of phonetic sound / syllables / words per bit and per second, but the bps comes out the same.

I don't know how widely accepted that conclusion is, what exceptions there may be, etc.

blobinabottle · 2026-02-04T19:07:25.000Z 1770232045

Impressive results, tested on crappy audio files (in french and english)...

serf · 2026-02-04T15:53:44.000Z 1770220424

things I hate:

"Click me to try now!" banners that lead to a warning screen that says "Oh, only paying members, whoops!"

So, you don't mean 'try this out', you mean 'buy this product'.

Let's not act like it's a free sampler.

I can't comment on the model : i'm not giving them money.

ReadEvalPost · 2026-02-04T15:57:57.000Z 1770220677

You can try it on HF: https://huggingface.co/spaces/mistralai/Voxtral-Mini-Realtim...

boobsbr · 2026-02-04T16:21:29.000Z 1770222089

I'm impressed.

siddbudd · 2026-02-04T17:48:38.000Z 1770227318

Wired advertises this as "Ultra-Fast Translation"[^1]. A bit weird coming from a tech magazine. I hope it's just a "typo".

[^1]: https://www.wired.com/story/mistral-voxtral-real-time-ai-tra...

bigyabai · 2026-02-04T17:50:44.000Z 1770227444

It might be capable of translation; OpenAI Whisper was a transcription model that could do it.

Archelaos · 2026-02-04T16:12:28.000Z 1770221548

As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.

What estimates do others use?

yewenjie · 2026-02-04T17:55:04.000Z 1770227704

One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.

ashenke · 2026-02-04T19:06:19.000Z 1770231979

I don't know if that will change, but right now only the Voxtral Mini Transcribe V2 supports diarization and it's not open-weight. The Voxtral Realtime model doesn't support diarization, but is open-weight.

vojto11 · 2026-02-04T18:46:53.000Z 1770230813

WhisperX ?

jszymborski · 2026-02-04T18:14:19.000Z 1770228859

I'm guessing I won't be able to finetune this until they come out with a HF tranformers model, right?

derac · 2026-02-04T18:02:31.000Z 1770228151

Any chance Voxtral Mini Transcribe 2 will ever be an open model?

ewuhic · 2026-02-04T17:52:18.000Z 1770227538

Can it translate in real time?

boringg · 2026-02-04T16:52:58.000Z 1770223978

Pseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),

ffsm8 · 2026-02-04T18:03:33.000Z 1770228213

It's a real issue, but why do you only see it in ai? It's true for any case where you're speaking into a microphone

Depending on the permissions granted to apps on your mobile device, it can even be passively exfiltrated without you ever noticing - and that's ignoring the video clips people take and put online. Like your grandma uploading to Facebook a short moment from a Christmas meet or similar

There have already been successful scams - eg calls from "relatives" (AI) calling family members needing money urgently and convincing them to send the money...

dumpstate · 2026-02-04T17:31:35.000Z 1770226295

I'm on voxtral-mini-latest and that's why I started seeing 500s today lol

BrunoJo · 2026-02-04T18:27:33.000Z 1770229653

If you are looking for an easy transcription API you may want to check out https://lemonfox.ai/. It's powered by Whisper but we are planning to support more models.