Training Failed - AWS Machine Learning

浪子不回头ぞ 提交于 2021-01-29 05:45:48

问题


I am working on Aws Machine learning with MERN(Mongodb,Express,React,NodeJS)Stack Code.But the issue is that when I upload the data file (.csv file) for process machine learning after sometime process training is failed with TrainingFailed Error which is follow:

AlgorithmError: CannotStartContainerError. Please make sure the container can be run with 'docker run train'. Please refer SageMaker documentation for details. It is possible that the Dockerfile's entrypoint is not properly defined, or missing permissions.

I also setup the following settings in AWS Account.

Also give following permissions in AWS Account:

I also apply all the keys in mongodb configuration settings after all the settings and permissions I can not understand what I need to process of Machine learning.Actually Training is not completed and can not get modelartifacts in s3 bucket.Its look like : sagemaker process is not started . can any one help me about this?

My DockerFile which is stored at the project folder with named Dockerfile.

FROM ubuntu
RUN apt-get update
RUN apt-get install curl -y
RUN curl -sL https://deb.nodesource.com/setup_10.x -o nodesource_setup.sh
RUN bash nodesource_setup.sh
RUN apt install nodejs -y
WORKDIR /usr/app
COPY . /usr/app/
RUN npm install
EXPOSE 3000
ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py" ]

I also set Code Images in Docker Hub for Sagemaker linear learner and xgboost and also create repositories in ECR in aws.

I also copy train.py in opt/ml/code/train.py directory in aws and also got the output output: /home/ec2-user/SageMaker/docker_test_folder but still got this error.


回答1:


The error you get means that sagemaker is not able to launch your docker image, this is because you have not defined correctly the entry point. You can a take a look at my repo. Basically in your dockerfile you have to install some packages, create a folder let's say /opt/ml/code and put in this folder your training script that will be called train. The train file should respect some indications that you can read here.



来源:https://stackoverflow.com/questions/65628085/training-failed-aws-machine-learning

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!