问题
I'm new to Machine Learning and have some problems with image classification. Using a simple classifier technique K Nearest Neighbours I'm trying to distinguish Cats from Dogs.
My code so far:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
DATADIR = "/Users/me/Desktop/ds2/ML_image_classification/kagglecatsanddogs_3367a/PetImages"
CATEGORIES = ['Dog', 'Cat']
IMG_SIZE = 30
data = []
categories = []
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
categ_id = CATEGORIES.index(category)
for img in os.listdir(path):
try:
img_array = cv2.imread(os.path.join(path,img), 0)
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
data.append(new_array)
categories.append(categ_id)
except Exception as e:
# print(e)
pass
print(data[0])
s1 = pd.Series(data)
s2 = pd.Series(categories)
frame = {'Img array': s1, 'category': s2}
df = pd.DataFrame(frame)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
And here I get an error when trying to fit the data:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-76-9d98d7b11202> in <module>
2 from sklearn.neighbors import KNeighborsClassifier
3
----> 4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
5
6 print(X_train)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py in train_test_split(*arrays, **options)
2094 raise TypeError("Invalid parameters passed: %s" % str(options))
2095
-> 2096 arrays = indexable(*arrays)
2097
2098 n_samples = _num_samples(arrays[0])
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in indexable(*iterables)
228 else:
229 result.append(np.array(X))
--> 230 check_consistent_length(*result)
231 return result
232
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
203 if len(uniques) > 1:
204 raise ValueError("Found input variables with inconsistent numbers of"
--> 205 " samples: %r" % [int(l) for l in lengths])
206
207
ValueError: Found input variables with inconsistent numbers of samples: [24946, 22451400]
How to prepare the training the data properly? Btw. I don't want to use deep learning. This will be the next step for me.
Would appreciate any help here..
回答1:
If you don`t use deep learning for image classification,you have to prepare your data that fit to the supervised learning classification.
steps
1) Resize all images to same size.You can loop over each image and resize and save.
2) get the pixel vector of each image and create the dataset.As a example if your cat images are in "Cat" folder and Dog images are in "Dog" folder,iterate over all images inside the folder and get the pixel values.same time label the data as "cat"(cat=1) and "non-cat"(non-cat=0)
import os
import imageio
import pandas as pd
catimages = os.listdir("Cat")
dogimages = os.listdir("Dog")
catVec = []
dogVec = []
for img in catimages:
img = imageio.imread(f"Cat/{img}")
ar = img.flatten()
catVec.append(ar)
catdf = pd.DataFrame(catVec)
catdf.insert(loc=0,column ="label",value=1)
for img in dogimages:
img = imageio.imread(f"Dog/{img}")
ar = img.flatten()
dogVec.append(ar)
dogdf = pd.DataFrame(dogVec)
dogdf.insert(loc=0,column ="label",value=0)
3) concat catdf and dogdf and shuffle the dataframe
data = pd.concat([catdf,dogdf])
data = data.sample(frac=1)
now you have dataset with lable for your images.
4) split dataset to train and test and fit to the model.
回答2:
For using classical machine learning for image classification, as mentioned earlier, you would need transform the raw images in vectors or numpy arrays and extract features from it.
As suggested, often the preprocessing steps includes:
- Rescaling the images and normalizating it
- Converting the images to grayscale if colour does not play a vital role in classification
- Doing feature extraction, like creating feature vectors by applying varies computer vision filters for edge detection, pixel density detection etc.
- Finally dividing in to train-test split, before feeding into the model.
I found the following link that might be helpful to you, https://medium.com/@dataturks/understanding-svms-for-image-classification-cf4f01232700
From the issue that you had posted, I feel you should check the dimensions of X_train, y_train and X_test, y_test. The training data is probably not matching with your training labels.
Do, a quick X_train.shape and y_train.shape to see what are the dimensions coming.
来源:https://stackoverflow.com/questions/59294900/how-to-prepare-training-data-for-image-classification