error after running a job in google cloud ML

断了今生、忘了曾经 提交于 2019-12-04 18:45:25

Finally, after submitting 77 jobs to cloud ML I am able to run the job and problem was not with the arguments while submitting the job. It was about the IO errors generated by files .npy which have to stores using file_io.FileIo and read as StringIO.

These IO Errors have not been mentioned anywhere and one should check for them if they find any errors where it says no such file or directory.

You will need to modify your train.py to accept a "--job-dir" command-line argument.

When you specify --job-dir in gcloud, the service passes it through to your program as an argument, so your argparser (or tf.flags, depending on which you're using), will need to be modified accordingly.

I had the same issue and it seems like google cloud somehow uses that --job-dir anyway when loading your own script (even if you place it before -- on the gcloud command)

The way I fixed it like the official gcloud census example on line 153 and line 183:

parser.add_argument(
  '--job-dir',
  help='GCS location to write checkpoints and export models',
  required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')

train_model(**arguments)

Basically it means to let your python main program take in this --job-dir parameter, even if you are not using it.

In addition to adding --job-dir as accepted argument, I think you should also move the flag after the --.

From the getting started:

Run the local train command using the --distribued option. Be sure to place the flag above the -- that separates the user arguments from the command-line arguments

where, in that case, --distribued was a command-line argument

EDIT:

--job-dir IS NOT a user argument, so it is correct to place it before the --

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!