How do I split a list of phrases into words so I can use counter on them?

↘锁芯ラ 提交于 2019-12-14 02:58:36

问题


My data are conversation threads from a webforum. I created a function to clean the data of stop words, punctuation, and such. Then I created a loop to clean all the posts which were in my csv file and put them into a list. Then I did the word count. My problem is that list contains unicode phrases rather than individual words. How can I split up the phrases, so they are individual words that I can count. Here is my code below:

 def post_to_words(raw_post):
      HTML_text = BeautifulSoup(raw_post).get_text()
      letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
      words = letters_only.lower().split()
      stops = set(stopwords.words("english"))   
      meaningful_words = [w for w in words if not w in stops]
      return( " ".join(meaningful_words))

clean_Post_Text = post_to_words(fiance_forum["Post_Text"][0])
clean_Post_Text_split = clean_Post_Text.lower().split()
num_Post_Text = fiance_forum["Post_Text"].size
clean_posts_list = [] 

for i in range(0, num_Post_Text):
    clean_posts_list.append( post_to_words( fiance_forum["Post_Text"][i]))

from collections import Counter
     counts = Counter(clean_posts_list)
     print(counts)

My output looks like this: u'please follow instructions notice move receiver':1 I want it to look like this:

please: 1

follow: 1

instructions: 1

and so on....thanks so much!


回答1:


You already have a list of words so you don't need to split anything, forget calling str.join i.e " ".join(meaningful_words) and just create a Counter dict and update on each call to post_to_words, you are also doing way to much work, all you need to do is iterate over fiance_forum["Post_Text"] passing each element to the function. You only also need to create the set of stopwords once, not on every iteration:

from collections import Counter

def post_to_words(raw_pos, st):
    HTML_text = BeautifulSoup(raw_post).get_text()
    letters_only = re.sub("[^a-zA-Z]", " ", HTML_text)
    words = letters_only.lower().split()
    return (w for w in words if w not in st)



cn = Counter()
st = set(stopwords.words("english"))
for post in fiance_forum["Post_Text"]:
    cn.update(post_to_words(post, st)

That also avoids the need to create a huge list of words by doing the counting as you go.




回答2:


You were almost there, all you need is to split the string into words:

>>> from collections import Counter
>>> Counter('please follow instructions notice move receiver'.split())
Counter({'follow': 1,
         'instructions': 1,
         'move': 1,
         'notice': 1,
         'please': 1,
         'receiver': 1})


来源:https://stackoverflow.com/questions/37423684/how-do-i-split-a-list-of-phrases-into-words-so-i-can-use-counter-on-them

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!