Cannot change number of clusters in KMeansClustering Tensorflow

问题

I found this code and it works perfectly. THe idea - split my data and train KMeansClustering on it. So I create InitHook and iterator and use it for training.

class _IteratorInitHook(tf.train.SessionRunHook):
    """Hook to initialize data iterator after session is created."""

    def __init__(self):
        super(_IteratorInitHook, self).__init__()
        self.iterator_initializer_fn = None

    def after_create_session(self, session, coord):
        """Initialize the iterator after the session has been created."""
        del coord
        self.iterator_initializer_fn(session)


# Run K-means clustering.
def _get_input_fn():
    """Helper function to create input function and hook for training.
    Returns:
        input_fn: Input function for k-means Estimator training.
        init_hook: Hook used to load data during training.
    """
    init_hook = _IteratorInitHook()

    def _input_fn():
        """Produces tf.data.Dataset object for k-means training.
        Returns:
            Tensor with the data for training.
        """
        features_placeholder = tf.placeholder(tf.float32,
                                                my_data.shape)
        delf_dataset = tf.data.Dataset.from_tensor_slices((features_placeholder))
        delf_dataset = delf_dataset.shuffle(1000).batch(
            my_data.shape[0])
        iterator = delf_dataset.make_initializable_iterator()

        def _initializer_fn(sess):
            """Initialize dataset iterator, feed in the data."""
            sess.run(
                iterator.initializer,
                feed_dict={features_placeholder: my_data})

        init_hook.iterator_initializer_fn = _initializer_fn
        return iterator.get_next()

    return _input_fn, init_hook


input_fn, init_hook = _get_input_fn()

output_cluster_dir = 'parameters/clusters'

kmeans = tf.contrib.factorization.KMeansClustering(
    num_clusters=1024,
    model_dir=output_cluster_dir,
    use_mini_batch=False,
)


print('Starting K-means clustering...')
kmeans.train(input_fn, hooks=[init_hook])

But if I change num_clusters to 512 or 256 I get next error:

InvalidArgumentError: segment_ids[0] = 600 is out of range [0, 256)
[[node UnsortedSegmentSum (defined at /home/mikhail/.conda/envs/tf2/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]] [[node Squeeze (defined at /home/mikhail/.conda/envs/tf2/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]

Look like I have some problems on splitting data to batches OR my KMeans use 1024 clusters on default even I set another value!

I can't figure out what to change to make it work correctly. Traceback is huge, if its needed I can attach as a file.

回答1:

I found the problem: as you can see I save codebook to parameters/clusters. When it have created tensorflow save graph here too. So default behaviour for tensorflow - DO NOT CREATE new graph if it already exist!

So every time I tried to run KMeansClustering it still use graph, loaded from codebook. I solved the issue by deleting folder clusters every time I run KMeansClustering.

I still have some issues: I create new clusters, and start 2 scripts in parallel to create features using it: one of them creates for old codebook and one for new! Still forcing it, but my recommendation here is to restart everything after you created new codebook (maybe some info still loaded in tensorflow).

来源：https://stackoverflow.com/questions/56337848/cannot-change-number-of-clusters-in-kmeansclustering-tensorflow

标签

python

tensorflow

batch-processing

k-means