Text-to-Speech System Using Python and the Xtts Model – Text Limitations and Solution – Part 3

Text-to-Speech (TTS) systems have become increasingly essential in the world of AI, powering everything from virtual assistants to accessibility tools. In this article, we will walk through the process of setting up and using a TTS system based on Python, leveraging the Xtts model for high-quality speech synthesis. This guide is designed for developers who are looking to build their own TTS system and integrate it into projects.

Related

Overview of AI and ML

How to Set Up a Python Virtual Environment in Visual Studio Code

How to Set Up a Text-to-Speech Project with XTTS Model

How to Build a Text-to-Speech (TTS) Application Using Python and SQLite

Prerequisites

Before we begin, make sure you have Python 3.9.6 installed on your machine. You can check your Python version using the command:

python --version

Additionally, we will use a virtual environment to manage dependencies. To create a virtual environment and activate it, follow these steps:

python -m venv venv
venv/Scripts/activate

Once the environment is activated, it’s a good idea to upgrade your pip package installer to the latest version:

python.exe -m pip install --upgrade pip

Installing Required Packages

We will install the TTS package and other dependencies, using a local cache to speed up installations. Here’s how you can install the necessary packages:

pip install TTS --cache-dir "D:/internship/tts_project/.cache"
pip uninstall torch torchvision torchaudio
pip install transformers datasets torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --cache-dir "D:/internship/tts_project/.cache"

Additionally, we will install the NVIDIA CUDA toolkit to leverage GPU acceleration if needed. You can download it from NVIDIA’s website.

Initializing the TTS Model

With all the dependencies in place, we can now start setting up the TTS model. The following code will help load the model configuration and initialize it:

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf  # To save the output as a wav file

# Step 1: Load the model configuration
config = XttsConfig()
config.load_json("D:/internship/tts/assets/tts_configs/config.json")

# Step 2: Initialize the model
model = Xtts.init_from_config(config)

# Step 3: Load the pre-trained weights
model.load_checkpoint(config, checkpoint_dir="D:/internship/tts/assets/tts_configs", eval=True)

# Optional: If you have CUDA installed and want to use GPU, uncomment the line below
# model.cuda()

Synthesizing Speech

To create a function for converting text into speech, use the createTTS function below. This function takes text input, an input audio file for speaker characteristics, and outputs the synthesized speech into a .wav file:

def createTTS(text, input_audio, output_audio):
    # Step 4: Synthesize the output
    outputs = model.synthesize(
        text,
        config,
        speaker_wav=input_audio,  # Replace with the correct path
        gpt_cond_len=3,
        language="en",
    )

    # Step 5: Save the synthesized speech to a wav file
    output_wav = outputs['wav']
    sf.write(output_audio, output_wav, config.audio.sample_rate)

    print("Speech synthesis complete and saved to output.wav")

This function uses an input .wav file to capture the speaker’s voice characteristics and outputs the generated speech to a new .wav file.

Managing Long Text

When working with lengthy text, it’s often necessary to break it into smaller chunks before feeding it into the TTS model. The following function helps break down text based on punctuation, ensuring that each chunk is under a specified length:

import re

def break_text_by_punctuation(text, max_chunk_size=250):
    # Split the text using regular expression on punctuation marks
    sentences = re.split(r'([.,!?])', text)

    chunks = []
    current_chunk = ""

    for i in range(0, len(sentences)-1, 2):
        sentence = sentences[i] + sentences[i+1]
        # If adding the sentence to the current chunk exceeds the limit, store the chunk and start a new one
        if len(current_chunk) + len(sentence) > max_chunk_size:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            current_chunk += sentence

    # Add the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

This function ensures that the text is properly split into smaller chunks based on punctuation, allowing the TTS model to process each chunk independently.

Reading Data and Generating Speech

Now, let’s bring everything together. The following code reads a text file, splits the content into chunks, and converts each chunk into speech:

def getData():
    f = open('data.txt','r')
    data = f.read()
    f.close()
    return data

data = break_text_by_punctuation(getData())

count = 1
for d in data:
    print(d)
    createTTS(d, "input.wav", str(count) + ".wav")  
    count = count + 1

Complete code

# python version 3.9.6
# python --version
# python -m venv venv
# venv/Scripts/activate
# python.exe -m pip install --upgrade pip
# pip install TTS --cache-dir "D:/internship/tts_project/.cache"
# pip uninstall torch torchvision torchaudio
# pip install transformers datasets torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --cache-dir "D:/internship/tts_project/.cache"
# https://developer.nvidia.com/cuda-downloads
# pip install soundfile --cache-dir "D:/internship/tts_project/.cache"
# pip install deepspeed==0.10.3 --cache-dir "D:/internship/tts_project/.cache" optional

from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
import soundfile as sf  # To save the output as a wav file

# Step 1: Load the model configuration
config = XttsConfig()
config.load_json("D:/internship/tts/assets/tts_configs/config.json")

# Step 2: Initialize the model
model = Xtts.init_from_config(config)

# Step 3: Load the pre-trained weights
model.load_checkpoint(config, checkpoint_dir="D:/internship/tts/assets/tts_configs", eval=True)

# Optional: If you have CUDA installed and want to use GPU, uncomment the line below
# model.cuda()


def createTTS(text, input_audio, output_audio):
    # Step 4: Synthesize the output
    outputs = model.synthesize(
        text,
        config,
        speaker_wav=input_audio,  # Replace with the correct path
        gpt_cond_len=3,
        language="en",
    )

    # Step 5: Save the synthesized speech to a wav file
    output_wav = outputs['wav']
    sf.write(output_audio, output_wav, config.audio.sample_rate)

    print("Speech synthesis complete and saved to output.wav") 

def getData():
    f = open('data.txt','r')
    data = f.read()
    f.close()
    return data

import re

def break_text_by_punctuation(text, max_chunk_size=250):
    # Split the text using regular expression on punctuation marks
    sentences = re.split(r'([.,!?])', text)

    chunks = []
    current_chunk = ""

    for i in range(0, len(sentences)-1, 2):
        sentence = sentences[i] + sentences[i+1]
        # If adding the sentence to the current chunk exceeds the limit, store the chunk and start a new one
        if len(current_chunk) + len(sentence) > max_chunk_size:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            current_chunk += sentence

    # Add the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

data = break_text_by_punctuation(getData())

count = 1

for d in data:
    print(d)
    createTTS(d, "input.wav", str(count) + ".wav")  
    count = count + 1

Here, the getData function reads the text from a file, splits it into manageable chunks, and then generates a .wav file for each chunk using the createTTS function.

Conclusion

With this step-by-step guide, you can now build a powerful Text-to-Speech system using Python and the Xtts model. You can further enhance this system by exploring other language models or optimizing performance with GPU acceleration. This TTS system can be integrated into various applications, from voice assistants to content creation tools, offering flexibility and customization options tailored to your project’s needs.

Subscribe

Logi