Cannot load pipeline model from pyspark

强颜欢笑 提交于 2020-07-06 11:10:12

问题


Hello I try to load saved pipeline with Pipeline Model in pyspark.

    selectedDf = reviews\
        .select("reviewerID", "asin", "overall")

    # Make pipeline to build recommendation
    reviewerIndexer = StringIndexer(
        inputCol="reviewerID",
        outputCol="intReviewer"
        )
    productIndexer = StringIndexer(
        inputCol="asin",
        outputCol="intProduct"
        )
    pipeline = Pipeline(stages=[reviewerIndexer, productIndexer])
    pipelineModel = pipeline.fit(selectedDf)
    transformedFeatures = pipelineModel.transform(selectedDf)
    pipeline_model_name = './' + model_name + 'pipeline'
    pipelineModel.save(pipeline_model_name)

This code successfully save model in filesystem but the problem is that I can't load this pipeline to utilize it on other data. When I try to load model with following code I have this kind of error.

        pipelineModel = PipelineModel.load(pipeline_model_name)

Traceback (most recent call last):
  File "/app/spark/load_recommendation_model.py", line 12, in <module>
    sa.load_model(pipeline_model_name, recommendation_model_name, user_id)
  File "/app/spark/sparkapp.py", line 142, in load_model
    pipelineModel = PipelineModel.load(pipeline_model_name)
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 311, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 240, in load
  File "/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 497, in loadMetadata
  File "/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1379, in first
ValueError: RDD is empty

What is the problem? How can I solve this?


回答1:


I had the same issue. The problem was that I was running Spark on a cluster of nodes, but I wasn't using a shared file system to save my models. Thus, saving the trained model leaded to saving the model's data on the Spark workers which had the data in their memory. When I wanted to load the data, I used the same path which I used in the saving process. In this situation, Spark master goes and looks for the model in the specified path in ITS LOCAL, but the data is not complete there. Therefore, it asserts that the RDD (the data) is empty (if you take a look at the directory of the saved model you will see that there are only SUCCESS files, but for loading models, two other part-0000 files are necessary).

Using shared file systems like HDFS will fix the problem.



来源:https://stackoverflow.com/questions/51257956/cannot-load-pipeline-model-from-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!