Associating region index with true labels

不问归期 提交于 2019-12-11 17:49:43

问题


The documentation is somewhat vague about this whereas I would've thought it'd be a pretty straight-forward thing to implement.

The k_mean algorithm applied to the MNIST digit dataset outputs 10 regions with a certain number associated with it, though it isn't the number represented by most of the digits contained within that region.

I do have my ground_truth label table.

How do I make it so that each region generated by the k_mean algorithm ends up being labeled as the digit which has the highest probability of being covered?

I've spent hours yesterday making up this code to do that, but it's still incomplete:

# TODO: for centroid-average method, see   https://stackoverflow.com/a/25831425/9768291
def most_probable_digit(indices, data):
    """
    Avec un tableau d'indices (d'un label spécifique assigné par scikit, obtenu avec "get_indices_of_label")
    où se situent les vrais labels dans 'data', cette fonction calcule combien de fois chaque vrai label
    apparaît et retourne celui qui est apparu le plus souvent (et donc qui a la plus grande probabilité
    d'être le ground_truth_label désigné par la région délimitée par scikit).
    :param indices: tableau des indices dans 'data' qui font parti d'une région du k_mean
    :param data: toutes les données réparties dans les régions du k_mean
    :return: la valeur (le digit) le plus probable associé à cette région
    """
    actual_labels = []
    for i in indices:
        actual_labels.append(data[i])
    if verbose: print("The actual labels for each of those digits are:", actual_labels)
    counts = count_labels("actual labels", actual_labels)
    probable = counts.index(max(counts))
    if verbose: print("Most probable digit:", probable)
    return probable


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


# TODO: reassign in case of doubles
def obtain_corresponding_labels(data, real_labels):
    """
    Assign the most probable label to each region.
    :param data: list of regions associated with x_train or x_test (the order is preserved!)
    :param real_labels: actual labels to assign to the region numbers
    :return: the list of corresponding actual labels to region numbers
    """
    switches_to_make = []

    for i in range(10):
        list_of_indices = get_list_of_indices(data, i)  # indices in 'data' which are associated with region "i"
        probable_label = most_probable_digit(list_of_indices, real_labels)
        print("The assigned region", i, "should be considered as representing the digit ", probable_label)
        switches_to_make.append(probable_label)

    return switches_to_make


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    print("Error rate =     ", wrong / len(found) * 100, "%\n\n")


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate

For now, the problem with my code is that it can generate duplicates (if two regions have the same highest probability digit, that digit is associated with both regions).

Here is how I use the code:

kmeans = KMeans(n_clusters=10)  # TODO: eventually use "init=ndarray" to be able to use custom centroids for init ?
kmeans.fit(x_train)
training_labels = kmeans.labels_
print("Done with calculating the k-mean.\n")

switches_to_make = utils.obtain_corresponding_labels(training_labels, y_train)  # Obtaining the most probable labels

utils.treat_data(switches_to_make, training_labels, y_train)
print("Assigned labels:   ", training_labels)
print("Real labels:       ", y_train)


print("\n####################################################\nMoving on to predictions")
predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

I obtain approximately a 50% error rate with my code.


回答1:


If I understand you correctly, you want to assign the actual digit value as a cluster label that matches that cluster, correct? If that is the case, I don't think it is possible.

K-Means is an unsupervised learning algorithm. It does not understand what it is looking at and the labels it assigns are arbitrary. Instead of 0, 1, 2, ... it could have called them 'apple', 'orange', 'grape' ... . All K-Means can ever do, is to tell you that a bunch of data points are similar to each other based on some metric, that is all. It is great for data exploration or pattern finding. But not for telling you "What" it actually is.

It does not matter what post processing you do, because the computer can never, programmatically, know what the true labels are, unless you, the human, tell it. In which case you might as well use a supervised learning algorithm.

If you want to train a model, that, when it see's a number, it can then assign the correct label to it, you must use a supervised learning method (where labels are a thing). Look into Random Forest instead, for instance. Here is a similar endeavor.




回答2:


Here is the code to use my solution:

from sklearn.cluster import KMeans

import utils

# Extraction du dataset
x_train, y_train = utils.get_train_data()
x_test,  y_test  = utils.get_test_data()

kmeans = KMeans(n_clusters=10)
kmeans.fit(x_train)
training_labels = kmeans.labels_

switches_to_make = utils.find_closest_digit_to_centroids(kmeans, x_train, y_train)  # Obtaining the most probable labels (digits) for each region

utils.treat_data(switches_to_make, training_labels, y_train)

predictions = kmeans.predict(x_test)
utils.treat_data(switches_to_make, predictions, y_test)

And utils.py:

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin_min


use_reduced = True  # Flag variable to use the reduced datasets (generated by 'pre_process.py')
verbose = False  # Should debugging prints be shown


def get_data(reduced_path, path):
    """
    Pour obtenir le dataset désiré.
    :param reduced_path: path vers la version réduite (générée par 'pre_process.py')
    :param path: path vers la version complète
    :return: numpy arrays (data, labels)
    """
    if use_reduced:
        data = open(reduced_path)
    else:
        data = open(path)
    csv_file = csv.reader(data)
    data_points = []
    for row in csv_file:
        data_points.append(row)
    data_points.pop(0)  # On enlève la première ligne, soit les "headers" de nos colonnes
    data.close()

    # Pour passer de String à int
    for i in range(len(data_points)):  # for each image
        for j in range(len(data_points[0])):  # for each pixel
            data_points[i][j] = int(data_points[i][j])
            # # Pour obtenir des valeurs en FLOAT normalisées entre 0 et 1:
            # data_points[i][j] =  np.divide(float(data_points[i][j]), 255)

    # Pour séparer les labels du data
    y_train = []  # labels
    for row in data_points:
        y_train.append(row[0])  # first column is the label
    x_train = []  # data
    for row in data_points:
        x_train.append(row[1:785])  # other columns are the pixels

    x_train = np.array(x_train)
    y_train = np.array(y_train)
    print("Done with loading the dataset.")

    return x_train, y_train


def get_test_data():
    """
    Retourne le dataset de test désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_test.csv', '../data/mnist_test.csv')


def get_train_data():
    """
    Retourne le dataset de training désiré.
    :return: numpy arrays (data, labels)
    """
    return get_data('../data/reduced_mnist_train.csv', '../data/mnist_train.csv')


def display_data(x_train, y_train):
    """
    Affiche le digit voulu.
    :param x_train: le data (784D)
    :param y_train: le label associé
    :return:
    """
    # Exemple pour afficher: conversion de notre vecteur d'une dimension en 2 dimensions
    matrix = np.reshape(x_train, (28, 28))
    plt.imshow(matrix, cmap='gray')
    plt.title("Voici un " + str(y_train))
    plt.show()


def generate_mean_images(x_train, y_train):
    """
    Retourne le tableau des images moyennes pour chaque classe
    :param x_train:
    :param y_train:
    :return:
    """
    counts = np.zeros(10).astype(int)

    for label in y_train:
        counts[label] += 1

    sum_pixel_values = np.zeros((10, 784)).astype(int)

    for img in range(len(y_train)):
        for pixel in range(len(x_train[0])):
            sum_pixel_values[y_train[img]][pixel] += x_train[img][pixel]

    pixel_probability = np.zeros((len(counts), len(x_train[0])))  # (10, 784)

    for classe in range(len(counts)):
        for pixel in range(len(x_train[0])):
            pixel_probability[classe][pixel] = np.divide(sum_pixel_values[classe][pixel] + 1, counts[classe] + 2)

    mean_images = []

    if verbose:
        plt.figure(figsize=(20, 4))  # values of the size of the plot: (x,y) in INCHES
        plt.suptitle("Such wow, much impress !")

        for classe in range(len(counts)):
            class_mean = np.reshape(pixel_probability[classe], (28, 28))
            mean_images.append(class_mean)

            # Aesthetics
            plt.subplot(1, 10, classe + 1)
            plt.title(str(classe))
            plt.imshow(class_mean, cmap='gray')
            plt.xticks([])
            plt.yticks([])

        plt.show()

    return mean_images


#########
# used for "k_mean" (for now)


def count_labels(name, data):
    """
    S'occupe de compter le nombre de data associé à chacun des labels.
    :param name: nom de ce que l'on compte
    :param data: doit être 1D
    :return: counts = le nombre pour chaque label
    """
    header = "-- " + str(name) + " -- "  # making sure it's a String
    counts = [0]*10  # initializing the counting array

    for label in data:
        counts[label] += 1
    if verbose: print(header, "Amounts for each label:", counts)

    return counts


def get_list_of_indices(data, label):
    """
    Retourne une liste d'indices correspondant à tous les endroits
    où on peut trouver dans 'data' le 'label' spécifié
    :param data:
    :param label: le numéro associé à une région générée par k_mean
    :return:
    """
    return (np.where(data == label))[0].tolist()


def rearrange_labels(switches_to_make, to_change):
    """
    Takes region numbers and assigns the most probable digit (label) to it.
    For example, if switches_to_make[3] = 5, it means that the 4th region (index 3 of the list)
    should be considered as representing the digit "5".
    :param switches_to_make: list of changes to make
    :param to_change: this table will be changed according to 'switches_to_make'
    :return: nothing, the change is made in-situ
    """
    for region in range(len(to_change)):
        for label in range(len(switches_to_make)):
            if to_change[region] == label:                    # if it corresponds to the "wrong" label given by scikit
                to_change[region] = switches_to_make[label]   # assign the "most probable" label
                break


def count_error_rate(found, truth):
    wrong = 0
    for i in range(len(found)):
        if found[i] != truth[i]:
            wrong += 1
    percent = wrong / len(found) * 100

    print("Error rate =     ", percent, "%")
    return percent


def treat_data(switches_to_make, predictions, truth):
    rearrange_labels(switches_to_make, predictions)    # Rearranging the training labels
    count_error_rate(predictions, truth)               # Counting error rate


# TODO: reassign in case of doubles
# adapted from  https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make

Essentially, here is the function that solved my problems:

# adapted from  https://stackoverflow.com/a/45275056/9768291
def find_closest_digit_to_centroids(kmean, data, labels):
    """
    The array 'closest' will contain the index of the point in 'data' that is closest to each centroid.
    Let's say the 'closest' gave output as array([0,8,5]) for the three clusters. So data[0] is the
    closest point in 'data' to centroid 0, and data[8] is the closest to centroid 1 and so on.
    If the returned list is [9,4,2,1,3] it would mean that the region #0 (index 0) represents the digit 9 the best.
    :param kmean: the variable where the 'fit' data has been stored
    :param data: the actual data that was used with 'fit' (x_train)
    :param labels: the true labels associated with 'data' (y_train)
    :return: list where each region is at its index and the value at that index is the digit it represents
    """
    closest, _ = pairwise_distances_argmin_min(kmean.cluster_centers_,
                                               data,
                                               metric="euclidean")

    switches_to_make = []
    for region in range(len(closest)):
        truth = labels[closest[region]]
        print("The assigned region", region, "should be considered as representing the digit ", truth)
        switches_to_make.append(truth)

    print("Digits associated to each region (switches_to_make):", switches_to_make)
    return switches_to_make


来源:https://stackoverflow.com/questions/52764453/associating-region-index-with-true-labels

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!