I am using a microphone which records sound through a browser, converts it into a file and sends the file to a java server. Then, my java server sends the file to the cloud
Google "Speech-to-Text typically processes audio faster than real-time, processing 30 seconds of audio in 15 seconds on average" [1]. You can use Google APIs Explorer to test exactly how long your each request would take [2].
To speed up the transcribing you may try to add recognition metadata to your request [3]. You can provide phrase hints if you are aware of the context of the speech [4]. Or use enhanced models to use special set of machine learning models [5]. All these suggestions would improve the accuracy and might have effects on transcribing speed.
When using the streaming recognition, in config you can set singleUtterance
option to True
. This will detect if user pause speaking and cease the recognition. If not streaming request will continue until to the content limit, which is 1 minute of audio length for streaming request [6].