问题
I have a long text file (a screenplay). I want to turn this text file into a list (where every word is separated) so that I can search through it later on.
The code i have at the moment is
file = open('screenplay.txt', 'r')
words = list(file.read().split())
print words
I think this works to split up all the words into a list, however I'm having trouble removing all the extra stuff like commas and periods at the end of words. I also want to make capital letters lower case (because I want to be able to search in lower case and have both capitalized and lower case words show up). Any help would be fantastic :)
回答1:
Try the algorithm from https://stackoverflow.com/a/17951315/284795, ie. split text on whitespace, then trim punctuation. This carefully removes punctuation from the edge of words, without harming apostrophes inside words such as we're
.
>>> text
"'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'"
>>> text.split()
["'Oh,", 'you', "can't", 'help', "that,'", 'said', 'the', 'Cat:', "'we're", 'all', 'mad', 'here.', "I'm", 'mad.', "You're", "mad.'"]
>>> [word.strip(string.punctuation) for word in text.split()]
['Oh', 'you', "can't", 'help', 'that', 'said', 'the', 'Cat', "we're", 'all', 'mad', 'here', "I'm", 'mad', "You're", 'mad']
You might want to add a .lower()
回答2:
This is a job for regular expressions!
For example:
import re
file = open('screenplay.txt', 'r')
# .lower() returns a version with all upper case characters replaced with lower case characters.
text = file.read().lower()
file.close()
# replaces anything that is not a lowercase letter, a space, or an apostrophe with a space:
text = re.sub('[^a-z\ \']+', " ", text)
words = list(text.split())
print words
回答3:
A screenplay should be short enough to be read into memory in one fell swoop. If so, you could then remove all punctation using the translate
method. Finally, you can produce your list simply by splitting on whitespace using str.split
:
import string
with open('screenplay.txt', 'rb') as f:
content = f.read()
content = content.translate(None, string.punctuation).lower()
words = content.split()
print words
Note that this will change Mr.Smith
into mrsmith
. If you'd like it to become ['mr', 'smith']
then you could replace all punctation with spaces, and then use str.split
:
def using_translate(content):
table = string.maketrans(
string.punctuation,
' '*len(string.punctuation))
content = content.translate(table).lower()
words = content.split()
return words
One problem you might encounter using a positive regex pattern such as [a-z]+
is that it will only match ascii characters. If the file has accented characters, the words would get split apart.
Gruyère
would become ['Gruy','re']
.
You could fix that by using re.split
to split on punctuation.
For example,
def using_re(content):
words = re.split(r"[ %s\t\n]+" % (string.punctuation,), content.lower())
return words
However, using str.translate
is faster:
In [72]: %timeit using_re(content)
100000 loops, best of 3: 9.97 us per loop
In [73]: %timeit using_translate(content)
100000 loops, best of 3: 3.05 us per loop
回答4:
Use the replace method.
mystring = mystring.replace(",", "")
If you want a more elegent solution that you will use many times over read up on RegEx expressions. Most languages use them and they are extremely useful for more complicated replacements and such
回答5:
You could use a dictionary to specify what characters you don't want, and format the current string based on your choices.
replaceChars = {'.':'',',':'', ' ':''}
print reduce(lambda x, y: x.replace(y, replaceChars[y]), replaceChars, "ABC3.2,1,\nCda1,2,3....".lower())
Output:
abc321
cda123
回答6:
You can use a simple regexp for creating a set with all words (sequences of one or more alphabetic characters)
import re
words = set(re.findall("[a-z]+", f.read().lower()))
Using a set
each word will be included just once.
Just using findall
will instead give you all the words in order.
回答7:
You can try something like this. Probably need some work on the regexp though.
import re
text = file.read()
words = map(lambda x: re.sub("[,.!?]", "", x).lower(), text.split())
来源:https://stackoverflow.com/questions/18135967/creating-a-list-of-every-word-from-a-text-file-without-spaces-punctuation