How to extract closed caption transcript from YouTube video?

后端未结

关注

 10  591

囚心锁ツ 2020-12-22 19:32

Is it possible to extract the closed caption transcript from YouTube videos?

We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has clo

10条回答

感情败类 (楼主)

2020-12-22 19:40

Another option is to use youtube-dl:

youtube-dl --skip-download --write-auto-sub $youtube_url

The default format is vtt and the other available format is ttml (--sub-format ttml).

--write-sub
       Write subtitle file

--write-auto-sub
       Write automatically generated subtitle file (YouTube only)

--all-subs
       Download all the available subtitles of the video

--list-subs
       List all available subtitles for the video

--sub-format FORMAT
       Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"

--sub-lang LANGS
       Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags

You can use ffmpeg to convert the subtitle file to another format:

ffmpeg -i input.vtt output.srt

This is what the VTT subtitles look like:

WEBVTT
Kind: captions
Language: en

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies<00:00:02.429> and<00:00:02.580> gentlemen<00:00:02.879> I'd<00:00:03.870> like<00:00:04.020> to<00:00:04.110> thank

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen I'd like to thank
 

00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen I'd like to thank
you<00:00:04.440> for<00:00:04.620> coming<00:00:05.069> tonight<00:00:05.190> especially<00:00:05.609> at

00:00:05.930 --> 00:00:05.940 align:start position:0%
you for coming tonight especially at
 

00:00:05.940 --> 00:00:07.730 align:start position:0%
you for coming tonight especially at
such<00:00:06.180> short<00:00:06.690> notice

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370> sure<00:00:08.580> mr.<00:00:08.820> Irving<00:00:09.000> will<00:00:09.120> fill<00:00:09.300> you<00:00:09.389> in<00:00:09.420> on

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure mr. Irving will fill you in on
 

00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure mr. Irving will fill you in on
the<00:00:09.750> circumstances<00:00:10.440> that's<00:00:10.620> brought<00:00:10.920> us

00:00:11.030 --> 00:00:11.040 align:start position:0%
the circumstances that's brought us

Here are the same subtitles without the part at the top of the file and without tags:

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies and gentlemen I'd like to thank

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen I'd like to thank


00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen I'd like to thank
you for coming tonight especially at

00:00:05.930 --> 00:00:05.940 align:start position:0%
you for coming tonight especially at


00:00:05.940 --> 00:00:07.730 align:start position:0%
you for coming tonight especially at
such short notice

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm sure mr. Irving will fill you in on

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure mr. Irving will fill you in on


00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure mr. Irving will fill you in on
the circumstances that's brought us

You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).

This converts the VTT subtitles to a simpler format:

sed '1,/^$/d' *.vtt| # remove the part at the top
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds

This is what the output of the command above looks like:

00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us

This prints the closed captions of a video in the simplified format:

cap()(cd /tmp;rm -f -- *.vtt;youtube-dl --skip-download --write-auto-sub -- "$1";sed '1,/^$/d' -- *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3')

The command below downloads the captions of all videos on a channel. When there is an error like Unable to extract video data, -i (--ignore-errors) causes youtube-dl to skip the video instead of exiting with an error.

youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(id)s.%(ext)s' https://www.youtube.com/channel/$channelid;for f in *.vtt;do sed '1,/^$/d' "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'>"${f%.vtt}";done

0 讨论(0)

查看其它10个回答