CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

问题

I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer to convert text data into numerical vectors. However, I am getting an error saying:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

mealarray contains large strings in each of the elements. There are 5000 such samples. I am trying to vectorize this as given below:

vectorizer = CountVectorizer(
    stop_words='english',
    ngram_range=(1, 1),  #ngram_range=(1, 1) is the default
    dtype='double',
)
data = vectorizer.fit_transform(mealarray)

The full stacktrace :

File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
    self.fixed_vocabulary_)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 748, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 234, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 200, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

回答1:

Check the shape of mealarray. If the argument to fit_transform is an array of strings, it must be a one-dimensional array. (That is, mealarray.shape must be of the form (n,).) For example, you'll get the "no attribute" error if mealarray has a shape such as (n, 1).

You could try something like

data = vectorizer.fit_transform(mealarray.ravel())

回答2:

Got the answer to my question. Basically, CountVectorizer is taking lists (with string contents) as an argument rather than array. That solved my problem.

回答3:

A better solution is explicit call pandas series and pass it CountVectorizer():

>>> tex = df4['Text']
>>> type(tex)
<class 'pandas.core.series.Series'>
X_train_counts = count_vect.fit_transform(tex)

Next one won't work, cause its a frame and NOT series

>>> tex2 = (df4.ix[0:,[11]])
>>> type(tex2)
<class 'pandas.core.frame.DataFrame'>

回答4:

The error should be sufficient to get rid of the bug. Check if your dataframe or series has non string type element. Also, do check specifically if there are any nan values.

来源：https://stackoverflow.com/questions/26367075/countvectorizer-attributeerror-numpy-ndarray-object-has-no-attribute-lower

标签

python

numpy

scikit-learn

text-classification