Qwen3-TTS Logo
Qwen3-TTS, the best TTS model I've used so far!

I saw a note in my daily AI newsletter about a new TTS model that was out. I spent a better part of the day testing out its capabilities and attempting to do some fine tuning on it.

I’ve had a goal for a pretty long time to add audio versions that people can use to listen to each of my blog posts. But rather than narrate them myself, I wanted the audio to be computer generated. Not only that, but I wanted it to sound like I was actually doing the narrating. So far, any of the models that I’ve tried to accomplish this haven’t lived up to the high standards of quality that you’re used to seeing on my blog. So did Qwen3-TTS finally make the grade?

When I looked at the Qwen3-TTS GitHub, I noticed that it had a few ways to generate audio from text. You could just use a pre-configured voice, and even give it instructions on how to speak. You could also design a voice, clone a voice from a reference audio, or even do fine tuning on the model and then generate from that.

I initially set up a Kaggle notebook to test out the audio generation capabilities. I had an old dataset from a previous experiment of me reading the beginning of one of my blog posts. The audio quality was pretty bad, but I didn’t want to record any new stuff just yet. The voice cloning seemed to work pretty well! I decided to see if I could actually fine tune a model since that should get me even better results!

When Fine Tuning Goes Wrong

I ended up using a better mic that I bought for making YouTube videos (which I only used once so far) and recorded some more samples of myself reading from some of my blog posts. This time I used Logic Pro and messed with some plugins to make sure the audio sounded somewhat decent. I ended up recording about 30 files and decided that was probably enough for fine tuning.

I used Google Antigravity to make another Kaggle notebook for me and iterated on the code until it actually worked. I ran into out of memory issues (the Kaggle GPUs max out at 16GB) and had to use the smaller model. I think somewhere in all of the hacks that Gemini used to get the training to complete, something got messed up because I could only manage to infer gibberish from the trained model.

This is My Voice on Fine-Tuning. Any Questions?


I might need to actually train for longer, or use more voice samples, or maybe both. I just happen to be too lazy to do it right now since I already wasted like 4 hours on it. I also read some comments on GitHub that the fine-tuned model doesn’t work that well so I decided to just give up for now.

Voice Cloning Instead

Since I had found some early success in voice cloning, I decided to see how far I could get with that. I used a longer section of recording for my “reference audio” which is about 30 seconds long. After that I tried a 1.5 minute clip, which had some more variety to it. Now I’m wondering if I could throw a 30 minute clip at it just to get all of my nuances in there but something tells me that at a certain point, you get diminishing returns (and blow up the GPU RAM).

Ground Truth Audio


Generated Audio


I set up a new Kaggle notebook that would scrape the text from my blog, then break it up into sentences and generate a TTS voice clone of myself speaking each sentence. Then it would stitch all of the audio files together into a giant spoken article podcast thing! I used a similar method when I wrote that Siri feature that reads articles to you.

The results ended up sounding really good! In the past, I’ve tried this same technique and got a voice that somewhat resembled mine, but it would talk in an almost southern drawl. It was super hilarious, but also not accurate. The Qwen3-TTS model gets the cadence and prosody of my speech down pretty well, all with a one-shot example! Sure, the TTS doesn’t get everything perfectly the way I would read things, but it’s good enough for me!

My Previous Attempt Where I Sounded Texan

Adding a Podcast To My Blog

I started looking for projects that I could use to incorporate the audio article into my actual blog. Sort of like the transcript feature in YouTube where you can skip to sections by clicking on text. I found this project called Hyperaudio that fit the bill. Since the audio is generated from text that was lifted directly from my blog anyway, I could just generate a subtitle file while generating the blog post audio. That way I wouldn’t have to mess with something like Whisper or another ASR system which would probably be really inaccurate.

Since I generate my blog with Jekyll, it was pretty simple to add another value to my frontmatter, and then conditionally add a script that would add the tags for Hyperaudio to hook into. Plus it was super simple because I just had AI do it for me! I think it’s a fun feature that makes my blog more accessible, and I’d much rather have someone listen to my blog in my own voice, or at least a pretty good approximation to it.

So now you can listen to me read my blog to you, skip to sections by clicking on the text in the blog, and see the current sentence being highlighted.

Future Improvements

I’m pretty happy so far with how this feature works. One thing that’s annoying, though, is that it requires the blog post to already exist before I can generate audio for it. That’s kind of a chicken and egg problem, and hard to solve. I don’t want to generate an audio file for a blog post and then immediately have to make a new one because I changed a word somewhere. But it would be nice to publish the audio and the blog post at the same time.

I already set up a GitHub action to run from the Kaggle notebook after it uploads the files to Cloudflare storage. It picks up the new files, adds the frontmatter to the blog post, and then redeploys my blog. Sometimes I just love being a software nerd.

I’m currently filling out the backlog of my posts with audio, so you may or may not see audio on them until I get to it.

I guess at some point, if a better voice cloning TTS model comes out, I could update the files again, but for now they should do the job nicely!