What is the best way to handle large data with Tensorflow.js and tf.Tensor?

后端未结

关注

 2  587

南方客 2021-01-06 05:13

Question

I am using tf.Tensor and tf.concat() to handle large training data, and I found continuous using of tf.concat() get

2条回答

遥遥无期 (楼主)

2021-01-06 05:47
While the tf.concat and Array.push function look and behave similar there is one big difference:
- tf.concat creates a new tensor from the input
- Array.push adds the input to the first array
Examples

tf.concat
```
const a = tf.tensor1d([1, 2]);
const b = tf.tensor1d([3]);
const c = tf.concat([a, b]);

a.print(); // Result: Tensor [1, 2]
b.print(); // Result: Tensor [3]
c.print(); // Result: Tensor [1, 2, 3]
```
The resulting variable c is a new Tensor while a and b are not changed.

Array.push
```
const a = [1,2];
a.push(3);

console.log(a); // Result: [1,2,3]
```
Here, the variable a is directly changed.

Impact on the runtime

For the runtime speed, this means that tf.concat copies all tensor values to a new tensor before adding the input. This obviously takes more time the bigger the array is that needs to be copied. In contrast to that, Array.push does not create a copy of the array and therefore the runtime will be more or less the same no matter how big the array is.

Note, that this is "by design" as tensors are immutable, so every operation on an existing tensor always creates a new tensor. Quote from the docs:

Tensors are immutable, so all operations always return new Tensors and never modify input Tensors.

Therefore, if you need to create a large tensor from input data it is advisable to first read all data from your file and merge it with "vanilla" JavaScript functions before creating a tensor from it.

Handling data too big for memory

In case you have a dataset so big that you need to handle it in chunks because of memory restrictions, you have two options:
1. Use the trainOnBatch function
2. Use a dataset generator
Option 1: trainOnBatch

The trainOnBatch function allows to train on a batch of data instead of using the full dataset to it. Therefore, you can split your code into reasonable batches before training them, so you don't have to merge your data together all at once.

Option 2: Dataset generator

The other answer already went over the basics. This will allow you to use a JavaScript generator function to prepare the data. I recommend to use the generator syntax instead of an iterator factory (used in the other answer) as it is the more modern JavaScript syntax.

Exampe (taken from the docs):
```
function* dataGenerator() {
  const numElements = 10;
  let index = 0;
  while (index < numElements) {
    const x = index;
    index++;
    yield x;
  }
}

const ds = tf.data.generator(dataGenerator);
```
You can then use the fitDataset function to train your model.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

What is the best way to handle large data with Tensorflow.js and tf.Tensor?

Question

Examples

Impact on the runtime

Handling data too big for memory

Option 1: trainOnBatch

Option 2: Dataset generator