Scikit-learn: Loading images from folder to create a labelled dataset for KNN classification

问题

I want to do handwritten digit recognition using K-Nearest Neighbours classification with scikit-learn. I have a folder that has 5001 images of handwritten digits (500 images for each digit from 0-9).

I am trying to find a way to create a dataset based on these images, so that I can then create a training and testing set. I have read a lot of online tutorials about how to do K-Nearest Neighbours classification using scikit-learn but most of the tutorials load existing datasets such as the MNIST dataset of handwritten digits.

Is there any way to create your own dataset by reading images from a folder and then assigning a label to each image? I am not sure what methods I can use to do this. Any insights are appreciated.

回答1:

To read the data you should do something like this :

from os import listdir
from os.path import isfile, join
import re
import matplotlib.pyplot as plt

mypath = '.' # edit with the path to your data
files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

x = []
y = []

for file in files:
    label = file.split('_')[0] # assuming your img is named like this "eight_1.png" you want to get the label "eight"
    y.append(label)
    img = plt.imread(file)
    x.append(img)

Then you will need to manipulate a little bit x and y before give it to scikit learn but you should be fine.

回答2:

Does this help?

import os
import imageio


def convert_word_to_label(word):

    if word == 'zero':
        return 0
    elif word == 'one':
        return 1
    elif word == 'two':
        return 2
    elif word == 'three':
        return 3
    elif word == 'four':
        return 4
    elif word == 'five':
        return 5
    elif word == 'six':
        return 6
    elif word == 'seven':
        return 7
    elif word == 'eight':
        return 8
    elif word == 'nine':
        return 9



def create_dataset(path):
    X = []
    y = []

    for r, d, f in os.walk(path):
        for image in f:
            if '.jpg' in image:
                image_path = os.path.join(r, image)
                img = imageio.imread(image_path)
                X.append(img)
                word = image.split('_')[0]
                y.append(convert_word_to_label(word))
    return X, y

if __name__ == '__main__':
    X, y = create_dataset('path/to/image_folder/')

回答3:

You can use Pillow or opencv libraries to read your images.

For Pillow:

from PIL import Image 
import numpy as np

img = PIL.Image.open("image_location/image_name") # This returns an image object   
img = np.asarray(img) # convert it to ndarray

For Opencv:

import cv2

img = cv2.imread("image_location/image_name", cv2.IMREAD_GRAYSCALE)

To convert all of your images you can use, for example, os library:

import os

Create a list of your images names

loc = os.listdir('your_images_folder')

To store grayscale images with one color channel you can use an empty array

data = np.ones((# of images, image_size wxh))


  for i, l in enumerate(loc):

     # Full image path
     path = os.path.join("your_images_folder", l)

     img = np.asarray(PIL.Image.open(path))

     # Make a vector from an image
     img = img.reshape(-1, img.size)

     # store this vector
     data[i,:]  = img

As a result, wou will get numpy array "data" for your classification project. "y" vector can be added also in the same loop from the name of each image.

To trace your process with a progress bar in a loop sometimes tqdm library can be a proper solution. To store rgb images you can implement the same solution. For rgb images img.reshape(-1, ) will return your a longer vector.

来源：https://stackoverflow.com/questions/56848253/scikit-learn-loading-images-from-folder-to-create-a-labelled-dataset-for-knn-cl

标签

python

file

scikit-learn