问题
Does the BigQuery ML automatically split the dataset for training and evaluation? Or do we have to get manually 80% datset for training, 10% for validation and 10% for evaluation with logistic Regression BigQuery ML? If both are affirmative, which of these would be better?
Thanks
回答1:
Yes, BigQuery ML will automatically split data for it's validation processes. It would also be fairly common practice for you to manually split a holdout set to perform some additional validation on data that the model has never seen.
You can use the DATA_SPLIT_METHOD
argument to tell BigQuery ML how you want to split the data. The default split is AUTO_SPLIT
which is defined as follows:
When there are fewer than 500 rows in the input data, all rows are used as training data. When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split. When there are more than 50,000 rows in the input data, only 10,000 of them are used as evaluation data in a RANDOM split.
For more information I would recommend reading over the official documentation.
来源:https://stackoverflow.com/questions/58913361/spliting-dataset-for-training-and-evaluation-in-bigquery-ml