How to create text-to-speech with neural network

走远了吗. 提交于 2021-02-07 10:59:33

问题


I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.

While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.

Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.

from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")

Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.

import numpy as np
import wave

f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)

#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')

#Normalized bytes of wav
arr  = np.array(data)/255

How Should I train this?

From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.

Since RNNs deal with such variable data, I thought it would come in handy here.

Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.

EDIT 1 : In response to Aaron's comment

From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?

On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.

Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?


回答1:


It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.

You might also want to study existing implementation of wavenet training.

Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.



来源:https://stackoverflow.com/questions/43053969/how-to-create-text-to-speech-with-neural-network

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!