How to extract closed caption transcript from YouTube video?

后端 未结 10 591
囚心锁ツ
囚心锁ツ 2020-12-22 19:32

Is it possible to extract the closed caption transcript from YouTube videos?

We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has clo

10条回答
  •  感情败类
    2020-12-22 19:40

    Another option is to use youtube-dl:

    youtube-dl --skip-download --write-auto-sub $youtube_url
    

    The default format is vtt and the other available format is ttml (--sub-format ttml).

    --write-sub
           Write subtitle file
    
    --write-auto-sub
           Write automatically generated subtitle file (YouTube only)
    
    --all-subs
           Download all the available subtitles of the video
    
    --list-subs
           List all available subtitles for the video
    
    --sub-format FORMAT
           Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"
    
    --sub-lang LANGS
           Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags
    

    You can use ffmpeg to convert the subtitle file to another format:

    ffmpeg -i input.vtt output.srt
    

    This is what the VTT subtitles look like:

    WEBVTT
    Kind: captions
    Language: en
    
    00:00:01.429 --> 00:00:04.249 align:start position:0%
    
    ladies<00:00:02.429> and<00:00:02.580> gentlemen<00:00:02.879> I'd<00:00:03.870> like<00:00:04.020> to<00:00:04.110> thank
    
    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen I'd like to thank
     
    
    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen I'd like to thank
    you<00:00:04.440> for<00:00:04.620> coming<00:00:05.069> tonight<00:00:05.190> especially<00:00:05.609> at
    
    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you for coming tonight especially at
     
    
    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you for coming tonight especially at
    such<00:00:06.180> short<00:00:06.690> notice
    
    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice
    
    
    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm<00:00:08.370> sure<00:00:08.580> mr.<00:00:08.820> Irving<00:00:09.000> will<00:00:09.120> fill<00:00:09.300> you<00:00:09.389> in<00:00:09.420> on
    
    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure mr. Irving will fill you in on
     
    
    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure mr. Irving will fill you in on
    the<00:00:09.750> circumstances<00:00:10.440> that's<00:00:10.620> brought<00:00:10.920> us
    
    00:00:11.030 --> 00:00:11.040 align:start position:0%
    the circumstances that's brought us
     
    

    Here are the same subtitles without the part at the top of the file and without tags:

    00:00:01.429 --> 00:00:04.249 align:start position:0%
    
    ladies and gentlemen I'd like to thank
    
    00:00:04.249 --> 00:00:04.259 align:start position:0%
    ladies and gentlemen I'd like to thank
    
    
    00:00:04.259 --> 00:00:05.930 align:start position:0%
    ladies and gentlemen I'd like to thank
    you for coming tonight especially at
    
    00:00:05.930 --> 00:00:05.940 align:start position:0%
    you for coming tonight especially at
    
    
    00:00:05.940 --> 00:00:07.730 align:start position:0%
    you for coming tonight especially at
    such short notice
    
    00:00:07.730 --> 00:00:07.740 align:start position:0%
    such short notice
    
    
    00:00:07.740 --> 00:00:09.620 align:start position:0%
    such short notice
    I'm sure mr. Irving will fill you in on
    
    00:00:09.620 --> 00:00:09.630 align:start position:0%
    I'm sure mr. Irving will fill you in on
    
    
    00:00:09.630 --> 00:00:11.030 align:start position:0%
    I'm sure mr. Irving will fill you in on
    the circumstances that's brought us
    

    You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).

    This converts the VTT subtitles to a simpler format:

    sed '1,/^$/d' *.vtt| # remove the part at the top
    sed 's/<[^>]*>//g'| # remove tags
    awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds
    

    This is what the output of the command above looks like:

    00:00:01 ladies and gentlemen I'd like to thank
    00:00:04 you for coming tonight especially at
    00:00:05 such short notice
    00:00:07 I'm sure mr. Irving will fill you in on
    00:00:09 the circumstances that's brought us
    

    This prints the closed captions of a video in the simplified format:

    cap()(cd /tmp;rm -f -- *.vtt;youtube-dl --skip-download --write-auto-sub -- "$1";sed '1,/^$/d' -- *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3')

    The command below downloads the captions of all videos on a channel. When there is an error like Unable to extract video data, -i (--ignore-errors) causes youtube-dl to skip the video instead of exiting with an error.

    youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(id)s.%(ext)s' https://www.youtube.com/channel/$channelid;for f in *.vtt;do sed '1,/^$/d' "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'>"${f%.vtt}";done

提交回复
热议问题