after creating audio with google text to speech, i need to get the time of each word so i can align translation with the audio output. I can see that it is done be the other