How to prepare training data for image classification

问题

I'm new to Machine Learning and have some problems with image classification. Using a simple classifier technique K Nearest Neighbours I'm trying to distinguish Cats from Dogs.

My code so far:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

DATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages"
CATEGORIES = ['Dog', 'Cat']

IMG_SIZE = 30
data = []
categories = []

for category in CATEGORIES:
    path = os.path.join(DATADIR, category) 
    categ_id = CATEGORIES.index(category)
    for img in os.listdir(path):
        try:
            img_array = cv2.imread(os.path.join(path,img), 0)
            new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
            data.append(new_array)
            categories.append(categ_id)
        except Exception as e:
            # print(e)
            pass

print(data[0])


s1 = pd.Series(data)
s2 = pd.Series(categories)
frame = {'Img array': s1, 'category': s2}
df = pd.DataFrame(frame) 


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

And here I get an error when trying to fit the data:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-76-9d98d7b11202> in <module>
      2 from sklearn.neighbors import KNeighborsClassifier
      3 
----> 4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
      5 
      6 print(X_train)

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
   2094         raise TypeError("Invalid parameters passed: %s" % str(options))
   2095 
-> 2096     arrays = indexable(*arrays)
   2097 
   2098     n_samples = _num_samples(arrays[0])

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
    228         else:
    229             result.append(np.array(X))
--> 230     check_consistent_length(*result)
    231     return result
    232 

~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    203     if len(uniques) > 1:
    204         raise ValueError("Found input variables with inconsistent numbers of"
--> 205                          " samples: %r" % [int(l) for l in lengths])
    206 
    207 

ValueError: Found input variables with inconsistent numbers of samples: [24946, 22451400]

How to prepare the training the data properly? Btw. I don't want to use deep learning. This will be the next step for me.

Would appreciate any help here..

回答1:

If you don`t use deep learning for image classification,you have to prepare your data that fit to the supervised learning classification.

steps

1) Resize all images to same size.You can loop over each image and resize and save.

2) get the pixel vector of each image and create the dataset.As a example if your cat images are in "Cat" folder and Dog images are in "Dog" folder,iterate over all images inside the folder and get the pixel values.same time label the data as "cat"(cat=1) and "non-cat"(non-cat=0)

import os
import  imageio
import pandas as pd

catimages = os.listdir("Cat")
dogimages = os.listdir("Dog")
catVec = []
dogVec = []
for img in catimages:
       img = imageio.imread(f"Cat/{img}")
       ar = img.flatten()
       catVec.append(ar)    
catdf = pd.DataFrame(catVec)    
catdf.insert(loc=0,column ="label",value=1)

for img in dogimages:
       img = imageio.imread(f"Dog/{img}")
       ar = img.flatten()
       dogVec.append(ar)    
dogdf = pd.DataFrame(dogVec)    
dogdf.insert(loc=0,column ="label",value=0)

3) concat catdf and dogdf and shuffle the dataframe

data = pd.concat([catdf,dogdf])      
data = data.sample(frac=1)

now you have dataset with lable for your images.

4) split dataset to train and test and fit to the model.

回答2:

For using classical machine learning for image classification, as mentioned earlier, you would need transform the raw images in vectors or numpy arrays and extract features from it.

As suggested, often the preprocessing steps includes:

Rescaling the images and normalizating it
Converting the images to grayscale if colour does not play a vital role in classification
Doing feature extraction, like creating feature vectors by applying varies computer vision filters for edge detection, pixel density detection etc.
Finally dividing in to train-test split, before feeding into the model.

I found the following link that might be helpful to you, https://medium.com/@dataturks/understanding-svms-for-image-classification-cf4f01232700

From the issue that you had posted, I feel you should check the dimensions of X_train, y_train and X_test, y_test. The training data is probably not matching with your training labels.

Do, a quick X_train.shape and y_train.shape to see what are the dimensions coming.

来源：https://stackoverflow.com/questions/59294900/how-to-prepare-training-data-for-image-classification

标签

python

pandas

machine-learning

scikit-learn

data-science