How to use with kedro?

烈酒焚心 提交于 2021-02-11 14:58:23


I am using to prepare a streaming dataset which is used to train a tf.kears model. With kedro, is there a way to create a node and return the created to use it in the next training node?

The MemoryDataset will probably not work because cannot be pickled (deepcopy isn't possible), see also this SO question. According to issue #91 the deep copy in MemoryDataset is done to avoid modifying the data by some other node. Can someone please elaborate a bit more on why/how this concurrent modification could happen?

From the docs, there seems to be a copy_mode = "assign". Would it be possible to use this option in case the data is not picklable?

Another solution (also mentioned in issue 91) is to use just a function to generate the streaming inside the training node, without having the preceding dataset generation node. However, I am not sure what the drawbacks of this approach will be (if any). Would be greate if someone could give some examples.

Also, I would like to avoid storing the complete output of the streaming dataset, for example using tfrecords or as these options would use a lot of disk storage.

Is there a way to pass just the created object to use it for the training node?

