Getting FastPitch From NVidia Working

A little extra work.
machine learning
text to speech

January 19, 2023


I spend a reasonable amount of time watching Twitch streamers online while working on projects. For those who might not know, Twitch is a platform for people to stream games or other creative content. Mostly, I watch people play video games that I like. And, while watching a channel you earn what are called Channel Points. These are used for all kinds of different tasks like disabling peoples guns to spamming the screen and even Text-to-Speech. These kinds of models are not quite in the domain I’d consider apart of what I focus on but they exist and would be a good first step into this area.

We’re going to start with using the Nvidia TTS FastPitch from Hugging Face. If you’re not aware of Hugging Face then definitely check it out; much of the state of the art models end up on this platform for general use.


Settings this up with not as simple as pip install nemo_toolkit['all'] like stated in the documents. So, we’re going to go over all the problems and how I solved them along the way to getting this working.

Install cython

The first problem is that while cython was included in the dependencies it was not installed like it should of been. I’m not sure why this is the case becuase running a python3 -m pip install cython worked fine. So, that was the first problem which was solved without issue.

Install And Configure llvm

This was an annoying one to solve. I tried to install llvmlite from PyPi and that sadly didn’t work. I ended up having to install this via the pacakge manager using sudo apt-get install llvm on my desktop. Also, you’ll want to configure the environmental variable for llvm once it’s installed using export LLVM_CONFIG=$(which llvm-config) as you’ll need it to install the package.

Install pynini

For my Desktop, I needed to install the python dev packages to get this installed. The error I got was missing the python.h while trying to compile the package. If you don’t have this already installed then python3 -m pip install python3-dev and then simply install python3 -m pip install pynini and it should work.

Strangly, I didn’t have this problem on my laptop - which is running Arch - but instead I needed to download and compile OpenFST instead. I also ran an install of sudo pacman -S base-devel before this since the python-dev package does not exist in arch. And, then it worked fine.

Test and Use!

Let’s go ahead and do a trial run of this now.

# Load FastPitch
from nemo.collections.tts.models import FastPitchModel
spec_generator = FastPitchModel.from_pretrained("nvidia/tts_en_fastpitch")

# Load vocoder
from nemo.collections.tts.models import HifiGanModel
model = HifiGanModel.from_pretrained(model_name="nvidia/tts_hifigan")

parsed = spec_generator.parse("Welcome To the First Step Into Text To Speech!", normalize=False)

spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
audio = model.convert_spectrogram_to_audio(spec=spectrogram)

# export the file
import soundfile as sf
sf.write("audio/test_speech.wav",'cpu').detach().numpy()[0], int(22050))
# I don't know if this will work but we're going to try it.
from IPython.display import Audio 
from IPython.core.display import display
def beep():
    display(Audio('audio/test_speech.wav', autoplay=False))

Apply and Use!

Ok, now that we’ve got this all install we’ll need some data. I thought it would be useful to find a Public Domain book on Project Gutenberg which sounded interesting. I settled on Two Years and Four Months in a Lunatic Asylum by Hiram Chase which sounded like it would be interesting to listen to. Collecting the data is easy and we’ve been over this multiple times in the past.

import requests
url = ""
title = "Two Years and Four Months in a Lunatic Asylum".replace(' ', '-')
r = requests.get(url)
text = r.text

If we tryto use this as is we get an error since there is simply too much data.

    parsed = spec_generator.parse(text)
    spectrogram = spec_generator.generate_spectrogram(tokens=parsed)
    audio = model.convert_spectrogram_to_audio(spec=spectrogram)
except Exception as e:
    print( e )
WARNING! Your input is too long and could take a long time to normalize.Use split_text_into_sentences() to make the input shorter and then call normalize_list().
maximum recursion depth exceeded while calling a Python object

So, the methods refered to here are not specified anywhere in the documentation so I had to go find them. And, you can find them under nemo_text_processing.text_normalization.normalize which we’ll go ahead and do now:

from nemo_text_processing.text_normalization.normalize import Normalizer

# assert input_case in ["lower_cased", "cased"]
# Another thing from the code which is not doucmented
norm = Normalizer(input_case='cased')
sText = norm.split_text_into_sentences(text=text)
text = norm.normalize_list(sText)

The problem we have now is that the FastPitch model’s .parse() call requires a string and does not understand the list we get back from norm.normalize_list(). However, if we pass the normalize=False then it does accept it so we’ll do that. However, If we try to run this via a for loop then we’ll get a the dreaded CUDA Error:


    parsed = spec_generator.parse(text, normalize=False)
    audio = model.convert_spectrogram_to_audio(

except Exception as e:
    print( e )
CUDA out of memory. Tried to allocate 132.88 GiB (GPU 0; 10.91 GiB total capacity; 6.64 GiB already allocated; 3.08 GiB free; 7.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Conclusions and Observations

That is definitely not going on my GPU. I’ve attempted a few times to try to loop this but thus far had no real luck so either I’ll need to either get a bigger GPU or push this to the cloud. Although, I’m definitely not putting all this on a single GPU without some serious money.