How to treat number with decimals or with commas as one word in countVectorizer

試著忘記壹切 提交于 2021-01-28 18:24:03

问题


I am cleaning text and then passing it to the CountVectorizer function to give me a count of how many times each word appears in the text. The problem is that it is treating 10,000x as two words (10 and 000x). Similarly for 5.00 it is treating 5 and 00 as two different words.

I have tried the following:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus=["userna lightning strike megawaysnew release there's many  
ways win lightning strike megaways. start epic adventure today, seek 
mystery symbols, re-spins wild multipliers, mega spins gamble lead wins 
10,000x bet!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()


result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()

res_df45 = pd.DataFrame(result, columns = cols)

In the data frame, both "10" and "000x" are given a count of 1 but I need them to be treated as one word (10,000x). How can I do this?


回答1:


The default regex pattern the tokenizer is using for the token_pattern parameter is:

token_pattern='(?u)\\b\\w\\w+\\b'

So a word is defined by a \b word boundary at the beginning and the end with \w\w+ one alphanumeric character followed by one or more alphanumeric characters between the boundaries. To interpret the regex, the backslashes have to be escaped by \\.

So you could change the token pattern to:

token_pattern='\\b(\\w+[\\.,]?\\w+)\\b'

Explanation: [\\.,]?allows for the optional appearance of a . or ,. The regex for the first appearing alphanumeric character \w has to be extended to \w+ to match numbers with more than one digit before the punctuation.

For your slightly adjusted example:

corpus=["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer(token_pattern='\\b(\\w+[\\.,]?\\w+)\\b')
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))

Output:

   10,000x  2.5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna  
0        1    1   1    1   1          1     1   1   1        1      1       1      1       1  

Alternatively you could modify your input text, e.g. by replacing the decimal point .with underscore _ and removing commas standing between digits.

import re

corpus = ["I am userna lightning strike 2.5 release re-spins there's many 10,000x bet in NA!"]
for i in range(len(corpus)):
    corpus[i] = re.sub("(\d+)\.(\d+)", "\\1_\\2", corpus[i]) 
    corpus[i] = re.sub("(\d+),(\d+)", "\\1\\2", corpus[i])
analyzer = CountVectorizer().build_analyzer()
vectorizer = CountVectorizer()
result = vectorizer.fit_transform(corpus).todense()
cols = vectorizer.get_feature_names()
print(pd.DataFrame(result, columns = cols))

Output:

   10000x  2_5  am  bet  in  lightning  many  na  re  release  spins  strike  there  userna
0       1    1   1    1   1          1     1   1   1        1      1       1      1       1   


来源:https://stackoverflow.com/questions/57325870/how-to-treat-number-with-decimals-or-with-commas-as-one-word-in-countvectorizer

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!