Remove words from a subtitle file that aren't in a wordlist (of common words)

青春壹個敷衍的年華 提交于 2021-02-10 14:51:16

问题


I have some subtitle files, and I'm not intending to learn every single word in these subtitles, there is no need to learn some hard terms like: cleidocranial, dysplasia...

I found this script here: Remove words from a cell that aren't in a list. But I have no idea how to modify it or run it. (I'm using linux)

Here is our example:

subtitle file (.srt):

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial dysplasia are good.

wordlist of 3000 common words (.txt):

...
people
with
are
good
...

Output we need (.srt):

2
00:00:13,000 --> 00:00:15,000
People with * * are good.

Or just mark them if it's possible (.srt):

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

If there is a solution working just with plain texts (without timecodes), it's ok, just explain how to run it
Thank you.


回答1:


The following processes the 3rd line only of every '.srt' file. It can be easily adapted to process other lines and/or other files.

import os
import re
from glob import glob

with open('words.txt') as f:
    keep_words = {line.strip().lower() for line in f}

for filename_in in glob('*.srt'):
    filename_out = f'{os.path.splitext(filename_in)[0]}_new.srt'
    with open(filename_in) as fin, open(filename_out, 'w') as fout:
        for i, line in enumerate(fin):
            if i == 2:
                parts = re.split(r"([\w']+)", line.strip())
                parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                line = ''.join(parts) + '\n'
            fout.write(line)

Result (for the subtitle.rst you gave as example:

! cat subtitle_new.rst
2
00:00:13,000 --> 00:00:15,000
People with * * are good.

Alternative: just add a '*' next to out-of-vocabulary words:

# replace:
#                 parts[1::2] = [w if w.lower() in keep_words else '*' for w in parts[1::2]]
                parts[1::2] = [w if w.lower() in keep_words else f'{w}*' for w in parts[1::2]]

The output is then:

2
00:00:13,000 --> 00:00:15,000
People with cleidocranial* dysplasia* are good.

Explanation:

  • The first open is used to read in all wanted words, make sure they are in lowercase, and put them into a set (for fast membership test).
  • We use glob to find all filenames ending in '.srt'.
  • For each such file, we construct a new filename derived from it as '..._new.srt'.
  • We read in all lines, but modify only line i == 2 (i.e. the 3rd line, since enumerate by default starts at 0).
  • line.strip() removes the trailing newline.
  • We could have used line.strip().split() to split the line into words, but it would have left 'good.' as the last word; not good. The regex used is often used to split words (in particular, it leaves in single quotes such as "don't"; it may or may not be what you want, adapt at will of course).
  • We use a capturing group split r"([\w']+)" instead of splitting on non-word chars, so that we have both words and what separates them in parts. For example, 'People, who are good.' becomes ['', 'People', ', ', 'who', ' ', 'are', ' ', 'good', '.'].
  • The words themselves are every other element of parts, starting at index 1.
  • We replace the words by '*' if their lowercase form is not in keep_words.
  • Finally we re-assemble that line, and generally output all lines to the new file.



回答2:


you could simply run a python script like this:

with open("words.txt", "rt") as words:
    #create a list with every word
    wordList = words.read().split("\n")

with open("subtitle.srt", "rt") as subtitles:
    with open("subtitle_output.srt", "wt") as out:
        for line in subtitles.readlines():
            if line[0].isdigit():
                #ignore the line as it starts with a digit
                out.write(line)
                continue
            else:
                for word in line.split():
                    if not word in wordList:
                        out.write(line.replace(word, f"*{word}*"))

this script will replace every word that's not in the common words file with the modified *word* keeping the original file and putting everything into a new output file



来源:https://stackoverflow.com/questions/65550885/remove-words-from-a-subtitle-file-that-arent-in-a-wordlist-of-common-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!