问题
I found this code and it works perfectly. THe idea - split my data and train KMeansClustering on it. So I create InitHook and iterator and use it for training.
class _IteratorInitHook(tf.train.SessionRunHook):
"""Hook to initialize data iterator after session is created."""
def __init__(self):
super(_IteratorInitHook, self).__init__()
self.iterator_initializer_fn = None
def after_create_session(self, session, coord):
"""Initialize the iterator after the session has been created."""
del coord
self.iterator_initializer_fn(session)
# Run K-means clustering.
def _get_input_fn():
"""Helper function to create input function and hook for training.
Returns:
input_fn: Input function for k-means Estimator training.
init_hook: Hook used to load data during training.
"""
init_hook = _IteratorInitHook()
def _input_fn():
"""Produces tf.data.Dataset object for k-means training.
Returns:
Tensor with the data for training.
"""
features_placeholder = tf.placeholder(tf.float32,
my_data.shape)
delf_dataset = tf.data.Dataset.from_tensor_slices((features_placeholder))
delf_dataset = delf_dataset.shuffle(1000).batch(
my_data.shape[0])
iterator = delf_dataset.make_initializable_iterator()
def _initializer_fn(sess):
"""Initialize dataset iterator, feed in the data."""
sess.run(
iterator.initializer,
feed_dict={features_placeholder: my_data})
init_hook.iterator_initializer_fn = _initializer_fn
return iterator.get_next()
return _input_fn, init_hook
input_fn, init_hook = _get_input_fn()
output_cluster_dir = 'parameters/clusters'
kmeans = tf.contrib.factorization.KMeansClustering(
num_clusters=1024,
model_dir=output_cluster_dir,
use_mini_batch=False,
)
print('Starting K-means clustering...')
kmeans.train(input_fn, hooks=[init_hook])
But if I change num_clusters to 512 or 256 I get next error:
InvalidArgumentError: segment_ids[0] = 600 is out of range [0, 256)
[[node UnsortedSegmentSum (defined at /home/mikhail/.conda/envs/tf2/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]] [[node Squeeze (defined at /home/mikhail/.conda/envs/tf2/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
Look like I have some problems on splitting data to batches OR my KMeans use 1024 clusters on default even I set another value!
I can't figure out what to change to make it work correctly. Traceback is huge, if its needed I can attach as a file.
回答1:
I found the problem:
as you can see I save codebook to parameters/clusters
. When it have created tensorflow save graph here too.
So default behaviour for tensorflow - DO NOT CREATE new graph if it already exist!
So every time I tried to run KMeansClustering
it still use graph, loaded from codebook.
I solved the issue by deleting folder clusters
every time I run KMeansClustering
.
I still have some issues: I create new clusters, and start 2 scripts in parallel to create features using it: one of them creates for old codebook and one for new! Still forcing it, but my recommendation here is to restart everything after you created new codebook (maybe some info still loaded in tensorflow).
来源:https://stackoverflow.com/questions/56337848/cannot-change-number-of-clusters-in-kmeansclustering-tensorflow