Model Experiments

Fri Oct 13, 2023Listen to this post

So I'm going to continue the tradition of starting my blog posts with the word "so".

Since last time, I've set up torch and run some experiments both with suno/bark and jbetker/tortoise-tts-v2. They can both comfortably run on my meagre 8GB of GPU memory, along with an image captioning model. So the fact that things were exploding when run through cog tells me that there's some memory locality that I'm accidentally taking advantage of, or possibly some inadvertent model duplication I'm avoiding. In any case, I can now generate AI-voiced audio from text using a box under my desk rather than one dependent on the internet. I've got one more problem to crack, and then you can expect this blog to instantly-ish become a monologuing podcast too.

The basic elements of what I'm setting up are probably going to be in the catwalk repo. As of this specific writing, it's set up to take TTS requests and run them against suno/bark. And... not much else.

I'm still polishing up the use of TTS models for my purposes, and it's unlikely that I keep both tortoise and bark around. The basic comparison is:

bark is much, much easier to install and interact with, and is better documented
tortoise provides better and more consistent output for my use case and allows you to define custom voices out of the box. There's a bark fork that can do the latter, but the linked notebook makes it look both more annoying and more restrictive than the utilities provided by tortoise.

Installation speedbumps

Firstly, I had to install the Nvidia CUDA toolkit to get past the initial error thrown up by tortoise installation. Apparently this is because the default CUDA libraries contain a bunch of drivers, but not nvcc? Which is a CUDA compiler needed for some part of the build process of some tortoise requirement or other. Whatever; at this point I'm resigned to just installing random chunks of Nvidia software haphazardly on my model-running machine.

Secondly, and bizarrely, I had to downgrade pydantic to 1.9.1 from 2-something because deepspeed is incompatible with 2+ despite having it listed as a dependency in its conda file. A package named pydantic annoyingly slowing down development through bureaucracy and a strong-type-adjacent approach to validation is appropriate enough that I can't even really be surprised here.

Once I had that settled, I could run the installation successfully, and interact with a tortoise-tts pipeline. Or, rather I would be able to if tortoise-tts had a pipeline interface. This is one of the few models I've interacted with so far that has to be run manually. Which is kind of fine by me at this point. My current setup is a big iron machine sitting under my desk that I ssh into from my development laptop. It is zero marginal trouble for me to interact with the models I'll be chaining together through a web interface such as catwalk rather than docker containers or cog predictions. I'm hoping to relatively simply put together an interface that captions images, reads sentences and expands/explains tables and code blocks so that I can make another serious attempt at running this thing again.

As always, I'll let you know how it goes.