Creating a list of every word from a text file without spaces, punctuation

佐手、 提交于 2019-12-05 00:16:01

问题


I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.

The code i have at the moment is

file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words

I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)


回答1:


Try the algorithm from https://stackoverflow.com/a/17951315/284795, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're.

>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"

>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]

>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']

You might want to add a .lower()




回答2:


This is a job for regular expressions!

For example:

import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words



回答3:


A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate method. Finally, you can produce your list simply by splitting on whitespace using str.split:

import string

with open('screenplay.txt', 'rb') as f:
    content = f.read()
    content = content.translate(None, string.punctuation).lower()
    words = content.split()

print words

Note that this will change Mr.Smith into mrsmith. If you'd like it to become ['mr', 'smith'] then you could replace all punctation with spaces, and then use str.split:

def using_translate(content):
    table = string.maketrans(
        string.punctuation,
        ' '*len(string.punctuation))
    content = content.translate(table).lower()
    words = content.split()
    return words

One problem you might encounter using a positive regex pattern such as [a-z]+ is that it will only match ascii characters. If the file has accented characters, the words would get split apart. Gruyère would become ['Gruy','re'].

You could fix that by using re.split to split on punctuation. For example,

def using_re(content):
    words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
    return words

However, using str.translate is faster:

In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop

In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop



回答4:


Use the replace method.

mystring = mystring.replace(",", "")

If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such




回答5:


You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.

replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())

Output:

abc321
cda123



回答6:


You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)

import re
words = set(re.findall("[a-z]+", f.read().lower()))

Using a set each word will be included just once.

Just using findall will instead give you all the words in order.




回答7:


You can try something like this. Probably need some work on the regexp though.

import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())


来源:https://stackoverflow.com/questions/18135967/creating-a-list-of-every-word-from-a-text-file-without-spaces-punctuation

标签

工具导航Map

JSON相关