Built this because tones are killing my spoken Mandarin and I can't reliably hear my own mistakes.
It's a 9M Conformer-CTC model trained on ~300h (AISHELL + Primewords), quantized to INT8 (11 MB), runs 100% in-browser via ONNX Runtime Web.
Grades per-syllable pronunciation + tones with Viterbi forced alignment.
Try it here: https://simedw.com/projects/ear/
This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.
In short, my experience lines up with your native speakers.
I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.
For example, if I say 他是我的朋友 at normal conversational speed, it will assign `de` to 我, sometimes it interprets that I didn't have the retroflexive in `shi` and renders it `si`. Listened back to make sure I said everything, the phonemes are there in the recording, but the UI displays the wrong phonemes and tones.
By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.
Also, is this taking into account tone transformation? Example, third tones (bottom out tone) tend to smoosh into a second tone (rising) when multiple third tones are spoken in a row. Sometimes the first tone influences the next tone slightly, etc.
Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.