How to extract closed caption transcript from YouTube video?

后端 未结 10 592
囚心锁ツ
囚心锁ツ 2020-12-22 19:32

Is it possible to extract the closed caption transcript from YouTube videos?

We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has clo

10条回答
  •  温柔的废话
    2020-12-22 19:51

    (Obligatory 'this is probably an internal youtube.com interface and might break at any time')

    Instead of linking to another tool that does this, here's an answer to the question of "how to do this"

    Use fiddler or your browser devtools (e.g. Chrome) to inspect the youtube.com HTTP traffic, and there's a response from /api/timedtext that contains the closed caption info as XML.

    It seems that a response like this:

        

    we've got

    we're going

    means at time 0 is the word we've and at time 0+780 is the word got and at time 2280+810 is the word going, etc. This time is in milliseconds so for time 3090 you'd want to append &t=3 to the URL.

    You can use any tool to stitch together the XML into something readable, but here's my Power BI Desktop script to find words like "privilege":

    let
        Source = Xml.Tables(File.Contents("C:\Download\body.xml")),
        #"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:format", Int64.Type}}),
        body = #"Changed Type"{0}[body],
        p = body{0}[p],
        #"Changed Type1" = Table.TransformColumnTypes(p,{{"Attribute:t", Int64.Type}, {"Attribute:d", Int64.Type}, {"Attribute:w", Int64.Type}, {"Attribute:a", Int64.Type}, {"Attribute:p", Int64.Type}}),
        #"Expanded s" = Table.ExpandTableColumn(#"Changed Type1", "s", {"Attribute:ac", "Attribute:p", "Attribute:t", "Element:Text"}, {"s.Attribute:ac", "s.Attribute:p", "s.Attribute:t", "s.Element:Text"}),
        #"Changed Type2" = Table.TransformColumnTypes(#"Expanded s",{{"s.Attribute:t", Int64.Type}}),
        #"Removed Other Columns" = Table.SelectColumns(#"Changed Type2",{"s.Attribute:t", "s.Element:Text", "Attribute:t"}),
        #"Replaced Value" = Table.ReplaceValue(#"Removed Other Columns",null,0,Replacer.ReplaceValue,{"s.Attribute:t"}),
        #"Filtered Rows" = Table.SelectRows(#"Replaced Value", each [#"s.Element:Text"] <> null),
        #"Added Custom" = Table.AddColumn(#"Filtered Rows", "Time", each [#"Attribute:t"] + [#"s.Attribute:t"]),
        #"Filtered Rows1" = Table.SelectRows(#"Added Custom", each ([#"s.Element:Text"] = " privilege" or [#"s.Element:Text"] = " privileged" or [#"s.Element:Text"] = " privileges" or [#"s.Element:Text"] = "privilege" or [#"s.Element:Text"] = "privileges"))
    in
        #"Filtered Rows1"
    

提交回复
热议问题