How to extract closed caption transcript from YouTube video?

混江龙づ霸主 提交于 2019-12-02 16:43:27

Following document says only the owner of the channel can do this via standard youtube interface: https://developers.google.com/youtube/2.0/developers_guide_protocol_captions?hl=en

Cheap fix: You can click on the "interactive transscript" button - and copy the content this way. Of course you lose the milliseconds this way.

Extremely cheap fix: A shared youtube account - so that multiple people can edit and upload caption files.

Challenging solution: The youtube API allows downloading and uploading of caption files via HTTP... You may write a youtube API application to provide a browser user interface for uploading or downloading for ANY user or particular users.

Here is an example project for this in java http://apiblog.youtube.com/2011/01/youtube-captions-uploader-web-app.html

Here is very simple example of a working upload for everybody: http://yt-captions-uploader.appspot.com/

Will

Here's how to get the transcript of a YouTube video (when available):

  • Go to YouTube and open the video of your choice.
  • Click on the "More actions" button (3 horizontal dots) located next to the Share button.
  • Click "Open transcript"

Although the syntax may be a little goofy this is a pretty good solution.

Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video

You can view/copy/download a timecoded xml file of a youtube's closed captions file by accessing

http://video.google.com/timedtext?lang=[LANGUAGE]&v=[YOUTUBE VIDEO IDENTIFIER]

For example http://video.google.com/timedtext?lang=pt&v=WSVKbw7LC2w

NOTE: this method does not download autogenerated closed captions, even if you get the language right (maybe there's a special code for autogenerated languages).

Another option is to use youtube-dl:

youtube-dl --skip-download --write-auto-sub $youtube_url

The default format is vtt and the other available format is ttml (--sub-format ttml).

--write-sub
       Write subtitle file

--write-auto-sub
       Write automatically generated subtitle file (YouTube only)

--all-subs
       Download all the available subtitles of the video

--list-subs
       List all available subtitles for the video

--sub-format FORMAT
       Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"

--sub-lang LANGS
       Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags

You can use ffmpeg to convert the subtitle file to another format:

ffmpeg -i input.vtt output.srt

This is what the VTT subtitles look like:

WEBVTT
Kind: captions
Language: en

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
 </c>

00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>

00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
 </c>

00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
 </c>

00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>

00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
 </c>

Here are the same subtitles without the part at the top of the file and without tags:

00:00:01.429 --> 00:00:04.249 align:start position:0%

ladies and gentlemen I'd like to thank

00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen I'd like to thank


00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen I'd like to thank
you for coming tonight especially at

00:00:05.930 --> 00:00:05.940 align:start position:0%
you for coming tonight especially at


00:00:05.940 --> 00:00:07.730 align:start position:0%
you for coming tonight especially at
such short notice

00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice


00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm sure mr. Irving will fill you in on

00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure mr. Irving will fill you in on


00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure mr. Irving will fill you in on
the circumstances that's brought us

You can see that each subtitle text is repeated three times. There is a new subtitle text every eighth line (3rd, 11th, 19th, and 27th).

This converts the VTT subtitles to a simpler format:

sed '1,/^$/d' *.vtt| # remove the part at the top
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3' # print each new subtitle text and its start time without milliseconds

This is what the output of the command above looks like:

00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us

This prints the closed captions of a video in the simplified format:

cap()(cd /tmp;rm -f *.vtt;youtube-dl --skip-download --write-auto-sub "$1";sed '1,/^$/d' *.vtt|sed 's/<[^>]*>//g'|awk -F. 'NR%8==1{printf"%s ",$1}NR%8==3'|tee cap)

(Obligatory 'this is probably an internal youtube.com interface and might break at any time')

Instead of linking to another tool that does this, here's an answer to the question of "how to do this"

I used fiddler to inspect the youtube.com HTTP traffic, and there's a response from /api/timedtext that contains the closed caption info as XML.

It seems that a response like this:

    <p t="0" d="5430" w="1">
        <s p="2" ac="136">we&#39;ve</s>
        <s t="780" ac="252"> got</s>
    </p>
    <p t="2280" d="7170" w="1">
        <s ac="243">we&#39;re</s>
        <s t="810" ac="233"> going</s>
    </p>

means at time 0 is the word we've and at time 0+780 is the word got and at time 2280+810 is the word going, etc. This time is in milliseconds so for time 3090 you'd want to append &t=3 to the URL.

You can use any tool to stitch together the XML into something readable, but here's my Power BI Desktop script to find words like "privilege":

let
    Source = Xml.Tables(File.Contents("C:\Download\body.xml")),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:format", Int64.Type}}),
    body = #"Changed Type"{0}[body],
    p = body{0}[p],
    #"Changed Type1" = Table.TransformColumnTypes(p,{{"Attribute:t", Int64.Type}, {"Attribute:d", Int64.Type}, {"Attribute:w", Int64.Type}, {"Attribute:a", Int64.Type}, {"Attribute:p", Int64.Type}}),
    #"Expanded s" = Table.ExpandTableColumn(#"Changed Type1", "s", {"Attribute:ac", "Attribute:p", "Attribute:t", "Element:Text"}, {"s.Attribute:ac", "s.Attribute:p", "s.Attribute:t", "s.Element:Text"}),
    #"Changed Type2" = Table.TransformColumnTypes(#"Expanded s",{{"s.Attribute:t", Int64.Type}}),
    #"Removed Other Columns" = Table.SelectColumns(#"Changed Type2",{"s.Attribute:t", "s.Element:Text", "Attribute:t"}),
    #"Replaced Value" = Table.ReplaceValue(#"Removed Other Columns",null,0,Replacer.ReplaceValue,{"s.Attribute:t"}),
    #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each [#"s.Element:Text"] <> null),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Time", each [#"Attribute:t"] + [#"s.Attribute:t"]),
    #"Filtered Rows1" = Table.SelectRows(#"Added Custom", each ([#"s.Element:Text"] = " privilege" or [#"s.Element:Text"] = " privileged" or [#"s.Element:Text"] = " privileges" or [#"s.Element:Text"] = "privilege" or [#"s.Element:Text"] = "privileges"))
in
    #"Filtered Rows1"

You can download the streaming subtitles from YouTube with KeepSubs DownSub.

You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.

Choose Open Transcript from the ... dropdown to the right of the vote up/down and share links.

This will open a Transcript scrolling div on the right side.

You can then use Copy. Note that you cannot use Select All but need to click the top line, then scroll to the bottom using the scroll thumb, and then shift-click on the last line.

Note that you can also search within this text using the normal web page search.

There is a free python tool called YouTube transcript API

You can use it in scripts or as a command line tool:

pip install youtube_transcript_api

I just got this easily done manually by opening the transcript at the beginning of the video and left-clicking and dragging at the time 00:00 marker with the shift key pressed over a few lines at the beginning.

I then advanced the video to near the end. When the video stopped, I clicked the end of the last sentence whilst holding down the shift key once more. With CTRL-C I copied the text to the clipboard and pasted it into an editor.

Done!

Caveat: Be sure to have no RDP-Windows sharing the clipboard or Software such as Teamviewer is running at the same time as this procedure will overflow their buffers where a large amount of text is copied.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!