问题
I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer
to convert text data into numerical vectors. However, I am getting an error saying:
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
mealarray
contains large strings in each of the elements. There are 5000 such samples. I am trying to vectorize this as given below:
vectorizer = CountVectorizer(
stop_words='english',
ngram_range=(1, 1), #ngram_range=(1, 1) is the default
dtype='double',
)
data = vectorizer.fit_transform(mealarray)
The full stacktrace :
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 748, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 234, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 200, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
回答1:
Check the shape of mealarray
. If the argument to fit_transform is an array of strings, it must be a one-dimensional array. (That is, mealarray.shape
must be of the form (n,)
.) For example, you'll get the "no attribute" error if mealarray
has a shape such as (n, 1)
.
You could try something like
data = vectorizer.fit_transform(mealarray.ravel())
回答2:
Got the answer to my question. Basically, CountVectorizer is taking lists (with string contents) as an argument rather than array. That solved my problem.
回答3:
A better solution is explicit call pandas series and pass it CountVectorizer():
>>> tex = df4['Text']
>>> type(tex)
<class 'pandas.core.series.Series'>
X_train_counts = count_vect.fit_transform(tex)
Next one won't work, cause its a frame and NOT series
>>> tex2 = (df4.ix[0:,[11]])
>>> type(tex2)
<class 'pandas.core.frame.DataFrame'>
回答4:
The error should be sufficient to get rid of the bug. Check if your dataframe or series has non string type element. Also, do check specifically if there are any nan
values.
来源:https://stackoverflow.com/questions/26367075/countvectorizer-attributeerror-numpy-ndarray-object-has-no-attribute-lower