Creating a list of every word from a text file without spaces, punctuation

佐手、 提交于 2019-12-05 00:16:01


I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.

The code i have at the moment is

file = open('screenplay.txt', 'r')
words = list(
print words

I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)


Try the algorithm from, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

You might want to add a .lower()


This is a job for regular expressions!

For example:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text =
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words


A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate method. Finally, you can produce your list simply by splitting on whitespace using str.split:

import string

with open('screenplay.txt', 'rb') as f:
    content =
    content = content.translate(None, string.punctuation).lower()
    words = content.split()

print words

Note that this will change Mr.Smith into mrsmith. If you'd like it to become ['mr', 'smith'] then you could replace all punctation with spaces, and then use str.split:

def using_translate(content):
    table = string.maketrans(
        ' '*len(string.punctuation))
    content = content.translate(table).lower()
    words = content.split()
    return words

One problem you might encounter using a positive regex pattern such as [a-z]+ is that it will only match ascii characters. If the file has accented characters, the words would get split apart. Gruyère would become ['Gruy','re'].

You could fix that by using re.split to split on punctuation. For example,

def using_re(content):
    words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
    return words

However, using str.translate is faster:

In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop

In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop


Use the replace method.

mystring = mystring.replace(",", "")

If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such


You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.

replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())




You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)

import re
words = set(re.findall("[a-z]+",

Using a set each word will be included just once.

Just using findall will instead give you all the words in order.


You can try something like this. Probably need some work on the regexp though.

import re
text =
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())