So I'm going to continue the tradition of starting my blog posts with the word "so".
Since last time, I've set up
torch and run some experiments both with
jbetker/tortoise-tts-v2. They can both comfortably run on my meagre 8GB of GPU memory, along with an image captioning model. So the fact that things were exploding when run through
cog tells me that there's some memory locality that I'm accidentally taking advantage of, or possibly some inadvertent model duplication I'm avoiding. In any case, I can now generate AI-voiced audio from text using a box under my desk rather than one dependent on the internet. I've got one more problem to crack, and then you can expect this blog to instantly-ish become a monologuing podcast too.
The basic elements of what I'm setting up are probably going to be in the catwalk repo. As of this specific writing, it's set up to take TTS requests and run them against
suno/bark. And... not much else.
I'm still polishing up the use of TTS models for my purposes, and it's unlikely that I keep both
bark around. The basic comparison is:
barkis much, much easier to install and interact with, and is better documented
tortoiseprovides better and more consistent output for my use case and allows you to define custom voices out of the box. There's a
barkfork that can do the latter, but the linked notebook makes it look both more annoying and more restrictive than the utilities provided by
Firstly, I had to install the Nvidia CUDA toolkit to get past the initial error thrown up by
tortoise installation. Apparently this is because the default CUDA libraries contain a bunch of drivers, but not
nvcc? Which is a CUDA compiler needed for some part of the build process of some
tortoise requirement or other. Whatever; at this point I'm resigned to just installing random chunks of Nvidia software haphazardly on my model-running machine.
Secondly, and bizarrely, I had to downgrade
1.9.1 from 2-something because
deepspeed is incompatible with 2+ despite having it listed as a dependency in its conda file. A package named
pydantic annoyingly slowing down development through bureaucracy and a strong-type-adjacent approach to validation is appropriate enough that I can't even really be surprised here.
Once I had that settled, I could run the installation successfully, and interact with a
tortoise-tts pipeline. Or, rather I would be able to if
tortoise-tts had a pipeline interface. This is one of the few models I've interacted with so far that has to be run manually. Which is kind of fine by me at this point. My current setup is a big iron machine sitting under my desk that I
ssh into from my development laptop. It is zero marginal trouble for me to interact with the models I'll be chaining together through a web interface such as
catwalk rather than
docker containers or
cog predictions. I'm hoping to relatively simply put together an interface that captions images, reads sentences and expands/explains tables and code blocks so that I can make another serious attempt at running this thing again.
As always, I'll let you know how it goes.