extract English words from string in python

做~自己de王妃 提交于 2020-05-08 14:41:02

问题


I have a document that each line is a string. It might contain digits, non-English letters and words, symbols(such as ! and *). I want to extract the English words from each line(English words are separated by space). My code is the following, which is the map function of my map-reduce job. However, based on the final result, this mapper function only produces letters(such as a,b,c) frequency count. Can anyone help me find the bug? Thanks

import sys
import re

for line in sys.stdin:
    line = re.sub("[^A-Za-z]", "", line.strip())
    line = line.lower()
    words = ' '.join(line.split())
    for word in words:
        print '%s\t%s' % (word, 1)

回答1:


You've actually got two problems.

First, this:

line = re.sub("[^A-Za-z]", "", line.strip())

This removes all non-letters from the line. Which means you no longer have any spaces to split on, and therefore no way to separate it into words.

Next, even if you didn't do that, you do this:

words = ' '.join(line.split())

This doesn't give you a list of words, this gives you a single string, with all those words concatenated back together. (Basically, the original line with all runs of whitespace converted into a single space.)

So, in the next line, when you do this:

for word in words:

You're iterating over a string, which means each word is a single character. Because that's what strings are: iterables of characters.

If you want each word (as your variable names imply), you already had those, the problem is that you joined them back into a string. Just don't do this:

words = line.split()
for word in words:

Or, if you want to strip out things besides letters and whitespace, use a regular expression that strips out everything besides letters and whitespace, not one that strips out everything besides letters, including whitespace:

line = re.sub(r"[^A-Za-z\s]", "", line.strip())
words = line.split()
for word in words:

However, that pattern is still probably not what you want. Do you really want to turn 'abc1def' into the single string 'abcdef', or into the two strings 'abc' and 'def'? You probably want either this:

line = re.sub(r"[^A-Za-z]", " ", line.strip())
words = line.split()
for word in words:

… or just:

words = re.split(r"[^A-Za-z]", line.strip())
for word in words:



回答2:


There are two issues here:

  1. line = re.sub("[^A-Za-z]", "", line.strip()) will remove all the non-characters, making it hard to split word in the subsequent stage. One alternatively solution is like this words = re.findall('[A-Za-z]', line)

  2. As mentioned by @abarnert, in the existing code words is a string, for word in words will iterate each letter. To get words as a list of words, you can follow 1.



来源:https://stackoverflow.com/questions/25716221/extract-english-words-from-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!