Show HN: Three new Kitten TTS models – smallest less than 25MB

dawdler-purge · 2026-03-20T00:04:39.000Z 1773965079

I created a CLI wrapper for Kitten TTS: https://github.com/newptcai/purr

BTW, it seems that kitten (the Python package) has the following chain of dependencies: kittentts → misaki[en] → spacy-curated-transformers

So if you install it directly via uv, it will pull torch and NVIDIA CUDA packages (several GB), which are not needed to run kitten.

cristoperb · 2026-03-20T00:22:17.000Z 1773966137

Thanks, your install script worked for me.

In case it helps anyone else, the first time I tried to run purr I got "OSError: PortAudio library not found". Installing libportaudio (apt install libportaudio2) got it running.

Mic92 · 2026-03-20T05:52:41.000Z 1773985961

Also did create a cli. I had to fork the project and removed one unused import which allowed me to remove a lot of unused ml libraries: https://github.com/Mic92/puss-say

Mic92 · 2026-03-20T05:53:46.000Z 1773986026

Unfortunately upstream never looks at any pull requests.

yjftsjthsd-h · 2026-03-20T00:24:47.000Z 1773966287

Thank you so much, that fixes an enormous pain point I was hitting. It's not just the size, that dependency chain was actually breaking on my machine and failing to install. Are we losing something by dropping the extra dependencies?

dawdler-purge · 2026-03-20T01:19:01.000Z 1773969541

I don't think so. It is perhaps a bug to have this unnecessary dependency. I expect the author of kitten to fix this soon.

rohan_joshi · 2026-03-20T03:41:55.000Z 1773978115

thanks a lot for helping w this. yes i'll fix this asap.

dawdler-purge · 2026-03-20T04:28:10.000Z 1773980890

Please let me know when this has been fixed. I will update purr to make the installation steps simpler.

ilyaizen · 2026-03-20T15:44:35.000Z 1774021475

You might also like CopySpeak, a lightweight tool I've recently built for quick AI text-to-speech using the clipboard, featuring Kitten TTS and other engines.

https://github.com/ilyaizen/CopySpeak

kevin42 · 2026-03-19T16:40:41.000Z 1773938441

What I love about OpenClaw is that I was able to send it a message on Discord with just this github URL and it started sending me voice messages using it within a few minutes. It also gave me a bunch of different benchmarks and sample audio.

I'm impressed with the quality given the size. I don't love the voices, but it's not bad. Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.

rohan_joshi · 2026-03-19T17:00:52.000Z 1773939652

yeah we'll add some more professional-sounding voices and also support for diy custom voices. we tried to add more anime/cartoon-ish voices to showcase the expressivity.

Regarding running on the 3080 gpu, can you share more details on github issues, discord or email? it should be blazing fast on that. i'll add an example to run the model on gpu too.

nine_k · 2026-03-20T02:57:39.000Z 1773975459

I wonder if it's possible to guide the intonation in any way.

nsnzjznzbx · 2026-03-19T20:49:18.000Z 1773953358

Oh that is a good use case. Don't connect to email and all that insecure stuff. But as a sandbox for "try this out and deploy a demo". Got me thinking!

Aerroon · 2026-03-20T00:19:46.000Z 1773965986

I'm jealous. It took me far longer and much more frustration to get it to run.

Had to get the right Python version and make sure it didn't break anything with the previous Python version. A friend suggested using Docker, so I started down that path until I realized I'd probably have to set the whole thing up there myself. Eventually got it to run and I think I didn't break anything else.

I hate Python so much.

stavros · 2026-03-20T02:46:29.000Z 1773974789

Nowadays these frustrations shouldn't be a thing any more. If the author used uv, the script would be able to install its own dependencies and just work.

rohan_joshi · 2026-03-20T03:59:40.000Z 1773979180

yeah let me add uv and conda support to make it easier.

stavros · 2026-03-20T20:26:43.000Z 1774038403

Thanks! I asked my bot to make me a plugin for it and it one-shotted it, the resulting script was ~20 lines, very nice!

Barbing · 2026-03-20T07:13:20.000Z 1773990800

One of the most responsive developers I’ve ever seen, kudos

RobertoG · 2026-03-20T00:32:21.000Z 1773966741

why you don't use some kind of environment, Conda or something like that?

deathanatos · 2026-03-20T01:47:12.000Z 1773971232

I used uv, which should have generated a stable environment. No dice. There's a bug in spacey.

I suspect success is highly variable on macOS vs. Linux; the spacey bug is only in newer (3.14 only or later) Pythons, which Linux will have.

rohan_joshi · 2026-03-20T04:01:11.000Z 1773979271

thanks for pointing these errors out. we're looking into this and will help fix this.

sysworld · 2026-03-20T01:39:05.000Z 1773970745

Even the built in venv would've solved most of his issues too. But I agree with him in that Python documentation could be better. Or have a more unified system in place. I feel like every other how to doc I read on setting something Python up uses a different environment containment product.

jacquesm · 2026-03-20T02:35:33.000Z 1773974133

Conda was fantastic up to some point last year and since then I've had quite a few unresolvable version issues with it. It is really annoying, especially when you're tying multiple things together and each requires its own set of mutually exclusive specific versions of libraries. The latest like that was gnu radio and some out-of-tree stuff at the same time as a bluetooth library. High drama. I eventually gave up, rewrote the whole thing in a different language and it took less time than I had spent on trying to get the python solution duct-taped together.

I should learn to give up quicker.

Aerroon · 2026-03-20T01:46:22.000Z 1773971182

Because I need a new version of python very rarely (years go by). I don't remember all the arcane incantations to set everything up.

I did eventually do that though, and I'm pretty sure I had to mess about with installing and uninstalling torch.

I dread using anything made in python because of this. It's always annoying and never just works (if the version of python is incompatible, otherwise it's fine) .

RobertoG · 2026-03-20T09:34:21.000Z 1773999261

I don't know, I'm pretty happy with Conda. I just create a new environment and install on it. It normally works.

Even if you have to install using pip it just affect the active environment.

Maybe I'm only trying simple things.

rohan_joshi · 2026-03-20T03:57:20.000Z 1773979040

damnn, really sorry for the inconv, looks like some folks are having bad env issues. we're working on fixing this.

Aerroon · 2026-03-20T16:43:25.000Z 1774025005

It's absolutely not your fault. It's a skill issue and compatibility issue on my end and/or python. You guys are doing amazing.

Teknomadix · 2026-03-20T03:33:36.000Z 1773977616

Two words; Nix Flakes

g58892881 · 2026-03-20T16:31:15.000Z 1774024275

I created a demo running in the browser, on your device: https://next-voice.vercel.app

__fst__ · 2026-03-19T22:29:48.000Z 1773959388

Was playing around a bit and for its size it's very impressive. Just has issues pronounciating numbers. I tried to let it generate "Startup finished in 135 ms."

I didn't expect it to pronounciate 'ms' correctly, but the number sounded just like noise. Eventually I got an acceptable result for the string "Startup finished in one hundred and thirty five seconds.

rohan_joshi · 2026-03-19T22:42:00.000Z 1773960120

yeah we're fixing this at the model level too. but in the meantime, there is a way to add text preprocessing for you, and if you have a special use-cased, claude code should be able to one-shot custom preprocessing. its the way that most existing tts models (including sota cloud ones) deal w numbers and units, they just convert it into string.

rohan_joshi · 2026-03-19T22:43:16.000Z 1773960196

thanks a lot for trying it and giving feedback. custom preprocessing will fix this for 95% of use-cases. and as i mentioned, this will be fixed at the model level in the next release.

magicalhippo · 2026-03-19T23:26:46.000Z 1773962806

I tried it with some "hard mode" text:

The above SECDED check-bit encoding can be implemented in a similar way, but since it uses only three-bit patterns, mapping syndromes to correction masks can be done with three-input AND gates.

It sounded quite good indeed for the normal English stuff, but I guess predictably was quite bad at the domain-specific words. It misspoke "SECDED", had wrong emphasis on "syndromes", and pronounced "AND gates" like "and gates".

Could you give some example of what kind of preprocessing would help in this case? I tried some local LLMs, but they didn't do a good job (maybe my prompts sucked).

nozzlegear · 2026-03-19T23:23:13.000Z 1773962593

> pronounciating

I'm not sure if you're misspelling it deliberately or not, but the word you're looking for is "pronounce" and it's verb form "pronouncing", as in "It just has issues pronouncing numbers" and "I didn't expect it to pronounce 'ms' correctly."

fc417fc802 · 2026-03-20T01:20:41.000Z 1773969641

He mixed pronounce with enunciate. It's an understandable mistake IMO. (English also has annunciate. Truly a cursed language in many respects.)

https://en.wiktionary.org/wiki/enunciate#English

daneel_w · 2026-03-19T20:12:42.000Z 1773951162

A very clear improvement from the first set of models you released some time ago. I'm really impressed. Thanks for sharing it all.

rohan_joshi · 2026-03-19T20:17:18.000Z 1773951438

thanks a lot. yeah these models are way better than our previous launch. our 15M model now is better than our previous 80M model and we expect to continue seeing this rate of improvement.

geokon · 2026-03-20T06:56:52.000Z 1773989812

Very cool :) Look forward to trying it out

Maybe a dumb and slightly tangential question, (I don't mean this as a criticism!) but why not release a command line executable?

Even the API looks like what you'd see in a manpage.

I get it wouldn't be too much work for a user to actually make something like that, I'm just curious what the thought process is

rohan_joshi · 2026-03-20T07:04:08.000Z 1773990248

great idea, we'll do that too. we just decided to launch an onnx first and get some feedback. we'll be simplifying the process of running it everywhere including a command line executable.

ks2048 · 2026-03-19T16:46:56.000Z 1773938816

You should put examples comparing the 4 models you released - same text spoken by each.

rohan_joshi · 2026-03-19T17:20:05.000Z 1773940805

great idea, let me add this. meanwhile, you can try the models on our huggingface spaces demo here: https://huggingface.co/spaces/KittenML/KittenTTS-Demo

_hzw · 2026-03-19T23:49:36.000Z 1773964176

I'd love to see a monolingual Japanese model sometime in the future. Qwen3-tts works for Japanese in general, but from time to time it will mix with some Mandarin in between, making it unusable.

rohan_joshi · 2026-03-20T03:46:01.000Z 1773978361

our next model(eta 3ish weeks) will support Japanese. would love to get your feedback then on how the quality is. can you share what usecase you want? would love to support it.

_hzw · 2026-03-20T16:42:42.000Z 1774024962

I have a pipeline of jp epub>m4b, just need to swap tts models in between :)

meatmanek · 2026-03-20T01:43:35.000Z 1773971015

You could try a preprocessing step where you convert to hiragana, but I guess that would lose pitch accent information (e.g. 飴 vs 雨)

_hzw · 2026-03-20T02:20:25.000Z 1773973225

Exactly. Qwen only has one pitch accent for pure hiragana words, even though it actually work (removing mandarin mixed-in), which requires some great efforts to normalize text in order to disambiguate heteronyms, the result is (if you use voice cloning) your favorite CV speaking in some weird, unknown accent :)

numpad0 · 2026-03-21T03:42:55.000Z 1774064575

That got me wondering if "you convert to hiragana" is a solved task, or a research team and five years[0], and Google showed me an article[1] that gave me a facepalm, quoting from Google Translate(square brackets are mine):

  > - As a result,
  >   - When the string "明日["tomorrow"]" is entered into TTS, the TTS model [･皿･] outputs an ambiguous pronunciation that sounds like a mix of "asu" and "ashita" (something like "[asyeta]").

  > From this, we found that by using the proposed method, it is possible to obtain data from private data in which the consistency between speech, graphemes, and phonemes is almost certainly maintained for more than 80% of the total.

  > Another possible cause is a mismatch between the domain of the training data's audio (all [in read-aloud tones]) and the inference domain.

My resultant rambling follows:

  1. Sounds like general state of Japanese speech dataset is a mess
    1.1. they don't maintain great useful correspondence between symbols to audio
    1.2. they tend to contain too much of "transatlantic" voices and less casual speeches
  2. Japanese speakers generally don't denote pronunciations for text
    2.1. therefore web crawls might not contain enough information as to how they're actually pronounced
    2.2. (potentially) there could be some texts that don't map to pronunciations
    2.3. (potentially) maybe Japanese spoken and literal languages are still a bit divergent from each others 
  3. The situation for Chinese/Sinitic languages are likely __nowhere__ near as absurd, and so Chinese STT/TTS might not be well equipped to deal with this mess
  4. This feels like much deeper mess than what commonly observed "a cloud in a sky" Japanese TTS problems such as obvious basic alignment errors(e.g. pronouncing "potatoes" as "tato chi")

---

  0: https://xkcd.com/1425/
  1: https://zenn.dev/parakeet_tech/articles/2591e71094ea58
  2: https://qiita.com/maishikawa/items/dcadfeebf693080f0415

jacquesm · 2026-03-20T02:36:27.000Z 1773974187

Good on device TTS is an amazing accessibility tool. Thank you for building this. Way too many of devices that use it rely on online services, this is much preferred.

rohan_joshi · 2026-03-20T04:02:12.000Z 1773979332

thanks a lot for the feedback. glad you liked it. we're gonna be launching more tiny models across use-cases.

nsnzjznzbx · 2026-03-19T20:47:10.000Z 1773953230

They sound like cartoon voices... but I really like them I could listen to a book with those.

rohan_joshi · 2026-03-19T21:03:46.000Z 1773954226

yeah we tried to include those voices in this release to showcase the expressivity. but we've already started adding more professional sounding voices for prod use-cases.

a96 · 2026-03-20T10:42:57.000Z 1774003377

Yeah, I was wondering if it's all helium voices. Should maybe try and see or find more demos.

PunchyHamster · 2026-03-19T22:37:11.000Z 1773959831

I ran install instructions and it took 7.1GB of deps, tf you mean "tiny" ?

rohan_joshi · 2026-03-19T22:39:27.000Z 1773959967

damnn, lemme fix it, sorry for that. we may have forgotten to remove the redundant dependencies. i'll comment here once i push the change. thanks a lot for trying it and giving feedback.

deathanatos · 2026-03-19T23:02:22.000Z 1773961342

It's mostly torch, I think. It pulls in NVIDIA libs (which … makes sense, I guess), and NVIDIA is just not at all judicious when it comes to disk space. I literally run out of disk trying to install this on Linux.

On macOS, it's a markedly different experience: it's only ~700 MiB there; I'm assuming b/c no NVIDIA libs get pulled in, b/c why would they.

For anyone who might want to play around with this: I can get down to ~3 GiB (& about 1.3 GiB if you wipe your uv cache afterwards) on Linux if I add the following to the end of `pyproject.toml`:

  [tool.uv.sources]
  # This tells uv to use the specific index for torch, torchvision, and torchaudio
  torch = [
      {index = "pytorch-cpu"}
  ]
  torchvision = [
      {index = "pytorch-cpu"}
  ]
  torchaudio = [
      {index = "pytorch-cpu"}
  ]
  
  [[tool.uv.index]]
  name = "pytorch-cpu"
  url = "https://download.pytorch.org/whl/cpu"

& add "torch" to the direct dependencies, b/c otherwise it seems like uv is ignoring the source? (… which of course downloads a CPU-only torch.)

This is an example of what one sees under Linux:

  nvidia-nvjitlink-cu12      ------------------------------ 23.83 MiB/37.44 MiB
  nvidia-curand-cu12         ------------------------------ 23.79 MiB/60.67 MiB
  nvidia-cuda-nvrtc-cu12     ------------------------------ 23.87 MiB/83.96 MiB
  nvidia-nvshmem-cu12        ------------------------------ 23.62 MiB/132.66 MiB
  triton                     ------------------------------ 23.82 MiB/179.55 MiB
  nvidia-cufft-cu12          ------------------------------ 23.76 MiB/184.17 MiB
  nvidia-cusolver-cu12       ------------------------------ 23.84 MiB/255.11 MiB
  nvidia-cusparselt-cu12     ------------------------------ 23.99 MiB/273.89 MiB
  nvidia-cusparse-cu12       ------------------------------ 23.96 MiB/274.86 MiB
  nvidia-nccl-cu12           ------------------------------ 23.79 MiB/307.42 MiB
  nvidia-cublas-cu12         ------------------------------ 23.73 MiB/566.81 MiB
  nvidia-cudnn-cu12          ------------------------------ 23.56 MiB/674.02 MiB
  torch                      ------------------------------ 23.75 MiB/873.22 MiB

That's not all the libraries, either, but you can see NVIDIA here is easily over 1 GiB.

It also then crashes for me, with:

  File "KittenTTS/.venv/lib/python3.14/site-packages/pydantic/v1/fields.py", line 576, in _set_default_and_type
    raise errors_.ConfigError(f'unable to infer type for attribute "{self.name}"')
  pydantic.v1.errors.ConfigError: unable to infer type for attribute "REGEX"

Which seems to be [this bug in spacey](https://github.com/explosion/spaCy/issues/13895), so I'm going to have to try adding `<3.14` to `requires-python` in `pyproject.toml` too I think. That is, for anyone wanting to try this out:

  -requires-python = ">=3.8"
  +requires-python = ">=3.8,<3.14"

(This isn't really something KittenTTS should have to do, since this is a bug in spacey … and ideally, at some point, spacey will fix it.)

Also:

  + curated-tokenizers==0.0.9

This version is so utterly ancient that there aren't wheels for it anymore, so that means a loooong wait while this builds. It's pulled in via misaki, and my editor says your one import of misaki is unused.

Hilariously, removing it breaks but only on macOS machine. I think you're using it solely for the side-effect that it tweaks phonemizer to use espeakng, but you can just do that tweak yourself, & then I think that dependency can be dropped. That drops a good number of dependencies & really speeds up the installation since we're not compiling a bunch of stuff.

You need to add `phonemizer-fork` to your dependencies. (If you remove misaki, you'll find this missing.)

rohan_joshi · 2026-03-20T04:08:55.000Z 1773979735

thanks a lot for sharing this, its v helpful for fixing the env issues. we'll fix all of them by the weekend.

zeroq · 2026-03-20T02:52:20.000Z 1773975140

a classic "how to draw an owl" lol :)

bobokaytop · 2026-03-20T08:11:39.000Z 1773994299

The size/quality tradeoff here is interesting. 25MB for a TTS model that's usable is a real achievement, but the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. Curious how this performs on something like a Raspberry Pi 4 for real-time synthesis. The voice quality tradeoff at that size usually shows up most in prosody and sentence-final intonation rather than phoneme accuracy.

altruios · 2026-03-19T16:39:43.000Z 1773938383

One of the core features I look for is expressive control.

Either in the form of the api via pitch/speed/volume controls, for more deterministic controls.

Or in expressive tags such as [coughs], [urgently], or [laughs in melodic ascending and descending arpeggiated gibberish babbles].

the 25MB model is amazingly good for being 25MB. How does it handle expressive tags?

rohan_joshi · 2026-03-19T17:02:33.000Z 1773939753

thank you so much. Right now, it cannot handle expressive tags. what kind of tags would be most helpful according to you?

altruios · 2026-03-19T17:42:37.000Z 1773942157

Emotion based tagging control would be the most helpful narrowing it down. Tags like [sarcastically] [happily] [joyfully] [fearfully]: so a subsection of adverbs.

A stretch goal is 'arbitrary tags' from [singing] [sung to the tune of {x}] [pausing for emphasis] [slowly decreasing speed for emphasis] [emphasizing the object of this sentence] [clapping] [car crash in the distance] [laser's pew pew].

But yeah: instruction/control via [tags] is the deciding feature for me, provided prompt adherence is strong enough.

Also: a thought...

Everyone is using [] for different kinds of tags in this space: which is very simple. Maybe it makes sense to differentiate kinds of tags? I.E. [tags for modifying how text is spoken] vs {tags for creating sounds not specifically speech: not modifying anything... but instead it's own 'sound/word'}

rohan_joshi · 2026-03-19T18:20:31.000Z 1773944431

yeah i think to start with, narrowing it down to a few tags would be most helpful and we'll probably start w that first. Thanks a lot!

daneel_w · 2026-03-19T20:17:52.000Z 1773951472

Intonation (frequency rise/fall) would offer a lot of versatility.

zeroq · 2026-03-20T02:55:42.000Z 1773975342

not OP but something like [<intention>] where intention might be something like anger, curiousness, etc. [long pause], [gasp], [laughter] stuff like that.

anilgulecha · 2026-03-20T17:35:14.000Z 1774028114

To the folks and Kitten team: I'm working on TTS as a problem statement (for an application), and what is the best model at the latency/cost inference. I'm currently settling for gemini TTS, which allows for a lot of expressiveness, but a word at 150ms starts to hurt when the content is a few sentences.

my current best approach is wrapping around gemini-flash native, and the model speaking the text i send it, which allows me end to end latency under a second.

are there other models at this or better pricing i can be looking at.

ks2048 · 2026-03-19T16:52:40.000Z 1773939160

There's a number of recent, good quality, small TTS models.

If the author doesn't describe some detail about the data, training, or a novel architecture, etc, I only assume they just took another one, do a little finetuning, and repackage as a new product.

the_duke · 2026-03-19T16:57:33.000Z 1773939453

Any recommendations?

Joel_Mckay · 2026-03-19T19:27:56.000Z 1773948476

Depends how small or complex you want a TTS, as flite + flitevox voice packages worked on pi or zynq ARM cpu just fine. =3

Also:

https://github.com/sparkaudio/spark-tts

jamamp · 2026-03-19T23:56:15.000Z 1773964575

The Github readme doesn't list this: what data trained this? Was it done by the voices of the creators, or was this trained on data scraped from the internet or other archives?

boutell · 2026-03-19T20:12:55.000Z 1773951175

Great stuff. Is your team interested in the STT problem?

rohan_joshi · 2026-03-19T20:18:16.000Z 1773951496

Yes, we've started working on it and will have a range of stt models v soon. lmk if you have a prod use-case in mind?

Barbing · 2026-03-19T23:48:24.000Z 1773964104

Many of my use cases are similar to those of: Robert J. P. Oberg - (GitHub) ognistik

Perhaps his YouTube channel is worth a watch. This video from four months ago compares various STT tools: https://youtu.be/pKU9CABtnOw

Speaking of apps that would, if I had to guess, love to integrate you:

FluidVoice is incredible and developing quickly. Handy is really hot right now. Also have VoiceInk out there, solid iOS option.

[ps-not parent commenter]

_boffin_ · 2026-03-20T03:09:54.000Z 1773976194

Thank you for this link.

rohan_joshi · 2026-03-20T04:09:23.000Z 1773979763

got it, this helps a lot. thanks!

boutell · 2026-03-20T21:09:22.000Z 1774040962

I'm interested in pushing the envelope a bit on the raspberry Pi to do personal assistant projects with it. The pi zero 2 is a surprisingly powerful little device, it is comparable to a pi 3B, except it has less RAM.

coder543 · 2026-03-19T22:35:39.000Z 1773959739

From my point of view, Parakeet is not very good at formatting the output, so it would be nice if a small model focused on having nicely formatted (and correct) text, not just the lowest WER score. Rewarding the model for inserting logical line breaks, quotation marks, etc.

bavell · 2026-03-19T22:11:31.000Z 1773958291

Home Assistant integration a la Alexa would be awesome

arcanemachiner · 2026-03-19T23:18:00.000Z 1773962280

Fingers crossed for a normal-sounding voice this time around. The cute Kitten voices are nice, but I want something I can take seriously when I'm listening to an audiobook.

rohan_joshi · 2026-03-19T23:31:53.000Z 1773963113

How is the Bruno voice for this one? there will also be another release in ~15-20 days where we have more professional voices. if you'd like to get early access and give feedback lmk, or dm me.

Barbing · 2026-03-19T23:40:39.000Z 1773963639

Hey Rohan, love to contribute feedback.

Huge fan of Ava multilingual and hopefully there are many other others with similar taste, so my feedback might shape things towards a halfway decent direction at least for some.

btw, use case is most often to listen to news/articles.

rohan_joshi · 2026-03-20T06:46:33.000Z 1773989193

got it, this is very useful. thanks a lot.

arcanemachiner · 2026-03-20T00:42:19.000Z 1773967339

Not too bad, actually! Looking forward to hearing some more voices.

No need to DM me, just post on HN or /r/LocalLLama and I'll catch wind of it.

Thanks for your work!

armcat · 2026-03-19T18:42:04.000Z 1773945724

This is awesome, well done. Been doing lot of work with voice assistants, if you can replicate voice cloning Qwen3-TTS into this small factor, you will be absolute legends!

rohan_joshi · 2026-03-19T18:51:21.000Z 1773946281

thanks a lot, our voice cloning model will be out by May. we're experimenting w some very cool ways of doing voice cloning at 15M but will have a range of models going upto 500M

armcat · 2026-03-19T20:01:48.000Z 1773950508

That's sick, looking forward to it! You have my email in the profile, please let me know when you do!

pumanoir · 2026-03-19T18:39:39.000Z 1773945579

The example.py file says "it will run blazing fast on any GPU. But this example will run on CPU."

I couldn't locate how to run it on a GPU anywhere in the repo.

rohan_joshi · 2026-03-19T18:52:01.000Z 1773946321

thanks for the feedback. i'll add an example of running it on gpu.

swaminarayan · 2026-03-20T02:38:47.000Z 1773974327

How did you make a very small AI model (14M) sound more natural and expressive than even bigger models?

rohan_joshi · 2026-03-20T04:14:55.000Z 1773980095

glad you liked it, thank you so much for the kind words. our team is really good at squeezing performance out of small models. we are working on a new launch and hope to release a technical report along with that which includes details. fyi, our current 14M model is better than our previous 80M model. and we expect this trend to continue.

magicalhippo · 2026-03-19T17:31:19.000Z 1773941479

A lot of good small TTS models in recent times. Most seem to struggle hard on prosody though.

Kokoro TTS for example has a very good Norwegian voice but the rhythm and emphasizing is often so out of whack the generated speech is almost incomprehensible.

Haven't had time to check this model out yet, how does it fare here? What's needed to improve the models in this area now that the voice part is more or less solved?

rohan_joshi · 2026-03-19T18:13:59.000Z 1773944039

small models struggle with prosody due to limited capacity. this version does much better than the precious one and is the best among other <25MB models. Kokoro is a really good model for its size, its competitive on artificial analysis too. i think by the next release we should have something kokoro quality but a fifth of the size. Adding control for rhythm seems to be quite important too, and we should start looking at that for other languages.

magicalhippo · 2026-03-19T22:14:02.000Z 1773958442

Listened to the video examples, sounded very good though wasn't terribly challenging text.

If only I could have that in Norwegian my SO would be pleased.

Also I totally misremembered regarding Kokoro TTS. It's good, but not what was butchering Norwegian. Forgot which one I was thinking of, maybe it was the old VITS stuff Rhaspy uses. Points stand, the voice was good but could barely understand what was said.

soco · 2026-03-19T17:46:22.000Z 1773942382

That, and also using English words in the middle of another language phrase confuses them a lot.

rohan_joshi · 2026-03-19T18:18:07.000Z 1773944287

yes. the current release of our model is english-only. so other languages are not expected to perform well. we'll try to look out for this in our multilingual release.

stbtrax · 2026-03-20T01:38:56.000Z 1773970736

Did they train this on @lauriewired's voice? The demo video sounds exactly like her at 0:18

rohan_joshi · 2026-03-20T04:12:45.000Z 1773979965

i can confirm that we did not.

stbtrax · 2026-03-20T04:19:10.000Z 1773980350

What's the source of that voice then in training data? It sounds insanely close to her voice. Strange parallel to openAI denying they trained on Scarlet J's voice.

devinprater · 2026-03-19T17:38:40.000Z 1773941920

A lot of these models struggle with small text strings, like "next button" that screen readers are going to speak a lot.

soco · 2026-03-19T17:45:04.000Z 1773942304

I think I tried on my Android everything I could try and 1. outside webpage reading, not many options; 2. as browser extensions, also not many (I don't like to copy URLs in your app) 3. they all insist reading every little shit, not only buttons but also "wave arrow pointing directly right" which some people use in their texts. So basically reading text aloud is a bunch of shitty options. Anyone jumping in this market opening?

rohan_joshi · 2026-03-19T18:22:21.000Z 1773944541

we'd love to serve this use-case. i'll make a demo for this next week and comment here with it.

fwsgonzo · 2026-03-19T16:51:00.000Z 1773939060

How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.

rohan_joshi · 2026-03-19T16:57:18.000Z 1773939438

shouldn't be hard. what backend/hardware are you interested in running this with? i'll add an example for using C++ onnx model. btw check out roadmap, our inference engine will be out 1-2 weeks and it is expected to be faster than onnx.

koolala · 2026-03-20T04:44:08.000Z 1773981848

I want to run it in a website with Wasm and having the browser do the audio playback

fwsgonzo · 2026-03-19T19:34:57.000Z 1773948897

desktop CPUs running inference on a single background thread would be the ideal case for what I'm considering.

vezycash · 2026-03-19T18:52:58.000Z 1773946378

Would an Android app of this be able to replace the built in tts?

rohan_joshi · 2026-03-19T18:54:44.000Z 1773946484

yes, our mobile sdk is coming soon(eta 2 weeks) so we should be able to replace the built-in version of it. can you share what tts use-case you're thinking of?

satvikpendem · 2026-03-19T18:59:56.000Z 1773946796

I use an epub reader like Moon+ with the built in TTS to turn epubs into audiobooks, and I tried Kokoro TTS but the issue was too much lag between sentences plus it doesn't preprocess the next sentence while it reads out the current one.

rohan_joshi · 2026-03-19T19:06:04.000Z 1773947164

okay this seems pretty doable, i think i know someone who is working on an epub reader using kittentts. if they don't post about it, i'll do it once its done.

gabrielcsapo · 2026-03-19T19:09:42.000Z 1773947382

Working on a reader and server that use pockettts to turn epubs into audio books https://github.com/gabrielcsapo/compendus shows a virtual scroller for the text and audio

spyder · 2026-03-20T15:00:59.000Z 1774018859

Nice, but it's weird that no "language" or "English" is mentioned on the github page, and only from the "Release multilingual TTS" Roadmap item could I guess it's probably English only for now.

agnishom · 2026-03-20T02:44:49.000Z 1773974689

I thought they were going to make kitten sounds instead of speech

rohan_joshi · 2026-03-20T04:15:34.000Z 1773980134

for that, a 100KB model could be enough ;)

casey2 · 2026-03-20T02:52:52.000Z 1773975172

I guess they are Discord kittens?

ilaksh · 2026-03-19T16:35:40.000Z 1773938140

Thanks for open sourcing this.

Is there any way to do a custom voice as a DIY? Or we need to go through you? If so, would you consider making a pricing page for purchasing a license/alternative voice? All but one of the voices are unusable in a business context.

rohan_joshi · 2026-03-19T16:46:44.000Z 1773938804

thanks a lot for the feedback. yes, we're working on a diy way to add custom voices and will also be releasing a model with more professional voices in the next 2-3 weeks. as of now, we're providing commercial support for custom voices, languages and deployment through the support form on our github. can you share more about your business use-case? if possible, i'd like to ensure the next release can serve that.

ilaksh · 2026-03-19T18:40:34.000Z 1773945634

Right now it's outgoing calls for a small business client that checks information. Although if they call back they don't mind an automated system, on outgoing calls the person answering will often hang up if they detect AI right away, so we use a realistic custom voice with an accent.

This is a mind numbing task that requires workers to make hundreds of calls each day with only minor variations, sometimes navigating phone trees, half the time leaving almost the exact same message.

Anyway, I believe almost all such businesses will be automated within months. Human labour just cannot compete on cost.

RobMurray · 2026-03-20T16:35:03.000Z 1774024503

I don't like the sound of that. Why do humans always need to spoil new advancements by finding the worst use cases?

ilaksh · 2026-03-20T21:54:39.000Z 1774043679

Why do you assume it's the worst use case? It's checking important info that has been entered into forms. People lie. Someone has to verify info. It's very tedious and something that obviously should be automated. And it's about 70% automated already.

The legitimate objection people have to AI in this use case is that it can be slow or stupid in a way that wastes time. By acting more humanlike, we signal that we are going to be closer to human level performance.

baibai008989 · 2026-03-20T11:06:13.000Z 1774004773

the dependency chain issue is a real barrier for edge deployment. i've been running tts models on a raspberry pi for a home automation project and anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case.

curious about the latency characteristics though. 1.5x realtime on a 9700 is fine for batch processing but for interactive use you need first-chunk latency under 200ms or the conversation feels broken. does anyone know if it supports streaming output or is it full-utterance only?

the phoneme-based approach should help with pronunciation consistency too. the models i've tried that work on raw text tend to mispronounce technical terms unpredictably — same word pronounced differently across runs.

oybng · 2026-03-20T14:00:53.000Z 1774015253

Could you share what you're currently using?

tim-projects · 2026-03-20T07:01:27.000Z 1773990087

Only American voices? For some reason I'm only interested in Irish, British or Welsh accents. American is a no

foo42 · 2026-03-20T08:13:14.000Z 1773994394

minor nit to pick: Welsh accents are British accents as Wales is in Britain. In fact by some definitions it's the most British part.

People from outside the UK often use British as synonymous with English, and in the context of accents, often a South East English accent or some sort of Received Pronunciation (RP) accent. Technically a "British" accent could be from anywhere in England, Scotland, or Wales, and therefore by extension might not even be the English language.

While I'm here, since it's generally confusing, the UK is Great Britain and Northern Ireland. Great Britain is England, Scotland, and Wales.

tim-projects · 2026-03-20T14:02:28.000Z 1774015348

I am actually English but I'm so used to speaking with international people I instinctively say British instead of English - because that's what people expect.

So being factually correct doesn't really matter. Nobody cares and nobody wants to learn so I adapt for them.

In the same way I almost exclusively write with American spelling now. Life is just easier when you stop fighting.

amelius · 2026-03-19T21:40:53.000Z 1773956453

How long until I can buy this as a chip for my Arduino projects?

rohan_joshi · 2026-03-19T22:44:32.000Z 1773960272

not v long. until then you start running tts on phones, wearables and r pis. at the model level, we'll have a model for this kind of mcu's later this year.

yjftsjthsd-h · 2026-03-20T00:26:17.000Z 1773966377

You can (just about?) already run on a pi zero, right? That's not literally a chip, but in practical utility it can't be very different

amelius · 2026-03-20T11:18:25.000Z 1774005505

A CPU will probably consume much more power.

Stevvo · 2026-03-20T07:43:32.000Z 1773992612

Found they struggle with numbers. Like, give them a random four digit number in a sentence and it fumbles.

pabs3 · 2026-03-20T05:48:25.000Z 1773985705

Is this open-source or open-weights ML?

rohan_joshi · 2026-03-20T06:44:57.000Z 1773989097

yes, indeed. we are working on adding mit licensed phonemizers too by this weekend, so you'll be able to use these models as you like :)

pabs3 · 2026-03-20T08:07:48.000Z 1773994068

I think you misunderstood the question. I guess its only open-weights not open-source then.

For some insight into the original question, take a look at the Debian ML policy:

https://salsa.debian.org/deeplearning-team/ml-policy

DavidTompkins · 2026-03-19T17:48:09.000Z 1773942489

This would be great as a js package - 25mb is small enough that I think it'd be worth it (in-browser tts is still pretty bad and varies by browser)

rohan_joshi · 2026-03-19T18:19:04.000Z 1773944344

great idea, we're on it. we're also working on a mobile sdk. a browser sdk would be really cool too.

great_psy · 2026-03-19T16:33:34.000Z 1773938014

Thanks for working on this!

Is there any way to get those running on iPhone ? I would love to have the ability for it to read articles to me like a podcast.

rohan_joshi · 2026-03-19T16:49:50.000Z 1773938990

yes, we're releasing an official mobile sdk and inference engine very soon. if you want to use something until then, some folks from the oss community have built ways to run kitten on ios. if you search kittentts ios on github you should find a few. if you cant find it, feel free to ping me and i can help you set it up. thanks a lot for your support and feedback!

sroussey · 2026-03-20T04:53:08.000Z 1773982388

It is based on onnx, so can i use with transformers.js and the browser?

aratahikaru5 · 2026-03-20T05:44:49.000Z 1773985489

Yes, someone already made a web demo for it: https://github.com/clowerweb/kitten-tts-web-demo (7 months ago). WebGPU support was marked experimental there, but transformer.js v4 (released last month) seems more stable now with some runtime/perf improvements: https://huggingface.co/blog/transformersjs-v4#performance--r...

sroussey · 2026-03-21T04:29:15.000Z 1774067355

Yeah I have a workflow platform that uses v4.

sschueller · 2026-03-19T18:19:08.000Z 1773944348

I'm still looking for the "perfect" setup in order to clone my voice and use it locally to send voice replies in telegram via openclaw. Does anyone have auch a setup?

I want to be my own personal assistant...

EDIT: I can provide it a RTX 3080ti.

ilaksh · 2026-03-19T18:48:02.000Z 1773946082

You need to provide info on your hardware. Pocket-TTS does cloning on CPU, but for me randomly outputs something pretty weird sounding mixed in with like 90% good outputs. So it hasn't been quite stable enough to run without checking output. But maybe it depends on your voice sample.

Qwen 3 TTS is good for voice cloning but requires GPU of some sort.

nicpottier · 2026-03-19T19:52:05.000Z 1773949925

Try training a model on piper, you will need to record a lot of utterances but the results are pretty great and the output is a fast TTS model.

justanotherunit · 2026-03-19T18:22:24.000Z 1773944544

Is it not just to train a model on your voice recordings and just use that to generate audio clips from text?

bdbdbdb · 2026-03-19T21:26:18.000Z 1773955578

Why not just send text replies? You can already do that

schopra909 · 2026-03-19T18:33:39.000Z 1773945219

Really cool to see innovation in terms of quality of tiny models. Great work!

rohan_joshi · 2026-03-19T18:53:46.000Z 1773946426

thanks a lot. small model quality is improving exponentially. This 15M is way better than the 80M model from our previous launch (V0.1).

gabrielcsapo · 2026-03-19T19:06:37.000Z 1773947197

are there plans to output text alignment?

rohan_joshi · 2026-03-19T19:07:43.000Z 1773947263

yes, we just started working on this yesterday haha, great that you mentioned it. once we have it working it'll be out soon.

gabrielcsapo · 2026-03-19T19:15:46.000Z 1773947746

that would be awesome, I was using pockettts then I had to run it through whisper to get the accurate alignment. Not super productive for realtime work.

rsmtjohn · 2026-03-20T07:33:14.000Z 1773991994

The <25MB figure is what stands out. Been wanting to add TTS to a few Next.js projects for offline/edge scenarios but model sizes have always made it impractical to ship.

At 25MB you can actually bundle it with the app. Going to test whether this works in a Vercel Edge Function context -- if latency is acceptable there it opens up a lot of use cases that currently require a round-trip to a hosted API.

erkoo · 2026-03-20T09:41:18.000Z 1773999678

How noticeable is the difference in quality between the 4M model and the 80M model?

janice1999 · 2026-03-19T18:41:02.000Z 1773945662

What's the actual install size for a working example? Like similar "tiny" projects, do these models actually require installing 1GB+ of dependencies?

deathanatos · 2026-03-19T19:12:54.000Z 1773947574

Running the example is 3 MiB for the repo, +667 MiB of Python dependencies, +86 MiB of models that will get downloaded from HuggingFace. =756 MiB.

(That's using the example as-is. If you switch it to the smaller model, modify the above with +57 MiB of models from HuggingFace, or =727 MiB.)

So I toyed with this a bit + the Rust library "ort", and ort is only 224M in release (non-debug) mode, and it was pretty simple to run this model with it. (I did not know ort before just now.) I didn't replicate the preprocessing the Python does before running the model, though. (You have to turn the text into an array of floats, essentially; the library is doing text -> phonemes -> tokens; the latter step is straight-forward.)

deathanatos · 2026-03-19T22:41:05.000Z 1773960065

So, that was on macOS. It's actually huge on Linux, and I've run out of disk space trying to pull dependencies. It's nvidia, who always shows great judgement in their use of disk.

wedowhatwedo · 2026-03-19T19:21:01.000Z 1773948061

My quick test showed 670m of python libraries required on top of the model.

wiradikusuma · 2026-03-19T17:31:07.000Z 1773941467

I'm thinking of giving "voice" to my virtual pets (think Pokemon but less than a dozen). The pets are made up animals but based on real animal, like Mouseier from Mouse (something like that). Is this possible?

Tldr: generate human-like voice based on animal sound. Anyway maybe it doesn't make sense.

rohan_joshi · 2026-03-19T18:57:58.000Z 1773946678

it'd be an interesting experiment to try what kind of information is extracted from the samples of the pet sounds. it'd be so cool if it can just get the features of the audio and then still be able to reproduce the audio in english lol. we would need a really good "speaker" encoder i think.

Tacite · 2026-03-19T16:36:40.000Z 1773938200

Is it English only?

rohan_joshi · 2026-03-19T16:51:46.000Z 1773939106

as of now its english only. the training for multilingual model is underway and should be out in April! what languages are you most interested in? Right now, we are providing deployments for custom languages + voices through support form on the github.

ivm · 2026-03-19T19:49:49.000Z 1773949789

Spanish would be great, there's a serious lack of Spanish TTS on Android compared to iOS and the quality is not the best.

rohan_joshi · 2026-03-19T22:46:05.000Z 1773960365

spanish model will be out in a matter of weeks.

jorgejiro · 2026-03-20T16:46:03.000Z 1774025163

Great, thanks

hattimaTim · 2026-03-20T07:09:12.000Z 1773990552

Here to suggest the Bengali language! It has the 7th largest speaker base worldwide but often ignored by tech companies, sadly.

Zopieux · 2026-03-19T18:54:43.000Z 1773946483

French, Spanish, German would go a long way.

rohan_joshi · 2026-03-19T22:46:56.000Z 1773960416

french, spanish and german models will also be out v soon. these are languages we are working on already. some lower resource languages will take longer.

whitepaper27 · 2026-03-19T18:59:20.000Z 1773946760

This is great. Demo looks awesome.

rohan_joshi · 2026-03-19T19:06:35.000Z 1773947195

thanks, glad you liked it

deathanatos · 2026-03-20T01:37:12.000Z 1773970632

So, one thing I noticed, and this could easily be user error, is that if I set the text & voice in the example to:

  text ="""
  Hello world. This is Kitten TTS.
  Look, it's working!
  """

  voice = 'Luna'

On macOS, I get "Kitten TTS", but on Linux, I get "Kit… TTS". Both OSes generate the same phonemes of,

  Phonemes: ðɪs ɪz kˈɪʔn ̩ tˌiːtˌiːˈɛs ,

which makes me really confused as to where it's going off the rails on Linux, since from there it should just be invoking the model.

edit: it really helps to use the same model facepalm. It's the 80M model, and it happens on both OS. Wildly the nano gets it better? I'm going to join the Discord lol.

rohan_joshi · 2026-03-20T04:17:03.000Z 1773980223

hey sorry for this issue, i think its a bug in our preprocessing. let me look into it and help fix it. i think you posted this in our discord so lets carry the conversation there.

pabs3 · 2026-03-20T05:46:11.000Z 1773985571

Whats the training data for this?

phyzix5761 · 2026-03-20T05:52:05.000Z 1773985925

Sounds like the voice actors from Critical Role but I just came off of watching 48 hours of Campaign 3 so I'm probably imagining things.

exe34 · 2026-03-19T20:36:49.000Z 1773952609

sounds amazing! does it stream? or is it so fast you don't need to?

rohan_joshi · 2026-03-19T22:30:06.000Z 1773959406

it can support chunk streaming, i'm working on adding it to the repo. should be up by tomorrow.

moralestapia · 2026-03-19T20:52:11.000Z 1773953531

Wow, what an amazing feat. Congratulations!

rohan_joshi · 2026-03-19T22:30:18.000Z 1773959418

thank you so much. glad you liked the model.

tredre3 · 2026-03-19T23:33:27.000Z 1773963207

This is something I've been looking for (the <50MB models in particular). Unfortunately my feedback is as follows:

      Downloading https://github.com/KittenML/KittenTTS/releases/download/0.8.1/kittentts-0.8.1-py3-none-any.whl (22 kB)
    Collecting num2words (from kittentts==0.8.1)
      Using cached num2words-0.5.14-py3-none-any.whl.metadata (13 kB)
    Collecting spacy (from kittentts==0.8.1)
      Using cached spacy-3.8.11-cp314-cp314-win_amd64.whl.metadata (28 kB)
    Collecting espeakng_loader (from kittentts==0.8.1)
      Using cached espeakng_loader-0.2.4-py3-none-win_amd64.whl.metadata (1.3 kB)
    INFO: pip is looking at multiple versions of kittentts to determine which version is compatible with other requirements. This could take a while.
    ERROR: Ignored the following versions that require a different python version: 0.7.10 Requires-Python >=3.8,<3.13; 0.7.11 Requires-Python >=3.8,<3.13; 0.7.12 Requires-Python >=3.8,<3.13; 0.7.13 Requires-Python >=3.8,<3.13; 0.7.14 Requires-Python >=3.8,<3.13; 0.7.15 Requires-Python >=3.8,<3.13; 0.7.16 Requires-Python >=3.8,<3.13; 0.7.17 Requires-Python >=3.8,<3.13; 0.7.5 Requires-Python >=3.8,<3.13; 0.7.6 Requires-Python >=3.8,<3.13; 0.7.7 Requires-Python >=3.8,<3.13; 0.7.8 Requires-Python >=3.8,<3.13; 0.7.9 Requires-Python >=3.8,<3.13; 0.8.0 Requires-Python >=3.8,<3.13; 0.8.1 Requires-Python >=3.8,<3.13; 0.8.2 Requires-Python >=3.8,<3.13; 0.8.3 Requires-Python >=3.8,<3.13; 0.8.4 Requires-Python >=3.8,<3.13; 0.9.0 Requires-Python >=3.8,<3.13; 0.9.2 Requires-Python >=3.8,<3.13; 0.9.3 Requires-Python >=3.8,<3.13; 0.9.4 Requires-Python >=3.8,<3.13; 3.8.3 Requires-Python >=3.9,<3.13; 3.8.5 Requires-Python >=3.9,<3.13; 3.8.6 Requires-Python >=3.9,<3.13; 3.8.7 Requires-Python >=3.9,<3.14; 3.8.8 Requires-Python >=3.9,<3.14; 3.8.9 Requires-Python >=3.9,<3.14
    ERROR: Could not find a version that satisfies the requirement misaki>=0.9.4 (from kittentts) (from versions: 0.1.0, 0.3.0, 0.3.5, 0.3.9, 0.4.0, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.4.9, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.9, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4)
    ERROR: No matching distribution found for misaki>=0.9.4

I realize that I can run a multiple versions of python on my system, and use venv to managed them (or whatever equivalent is now trendy), but as I near retirement age all those deep dependencies nets required by modern software is really depressing me. Have you ever tried to build a node app that hasn't been updated in 18 months? It can't be done. Old man yelling at cloud I guess shrugs.

rohan_joshi · 2026-03-19T23:37:27.000Z 1773963447

this is some env issue sorry for the inconvenience, lemme fix it. can you dm me w your env? discord / github / mail / anywhere works.

Remi_Etien · 2026-03-19T17:49:36.000Z 1773942576

25MB is impressive. What's the tradeoff vs the 80M model — is it mainly voice quality or does it also affect pronunciation accuracy on less common words?

rohan_joshi · 2026-03-19T18:16:46.000Z 1773944206

80M model is the highest quality while also being quite efficient. it is superior in terms of pronunciation accuracy for less common words, and also is more stable in terms of speed. its my fav model. i think the 40M is quite similar to 80M for most usecases. 15M is for resource cpus, loading onto a browser etc.

The new 15M is way better than the previous 80M model(v0.1). So we're able to predictably improve the quality which is very encouraging.