PySpark Throwing error Method getnewargs([]) does not exist

后端未结

关注

 3  1212

I have a set of files. The path to the files are saved in a file., say all_files.txt. Using apache spark, I need to do an operation on all the files and club th

相关标签:

3条回答

礼貌的吻别

2020-12-06 10:36
Using spark inside flatMap or any transformation that occures on executors is not allowed (spark session is available on driver only). It is also not possible to create RDD of RDDs (see: Is it possible to create nested RDDs in Apache Spark?)

But you can achieve this transformation in another way - read all content of all_files.txt into dataframe, use local map to make them dataframes and local reduce to union all, see example:
```
>>> filenames = spark.read.text('all_files.txt').collect()
>>> dataframes = map(lambda r: spark.read.text(r[0]), filenames)
>>> all_lines_df = reduce(lambda df1, df2: df1.unionAll(df2), dataframes)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-06 10:36

I also got this error trying to log my model with MLFlow using mlflow.sklearn.log_model when the model itself was a pyspark.ml.classification model. Using mlflow.spark.log_model solved the issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...
终归单人心

2020-12-06 10:42

I meet this problem today, finally figure out that I refered to a spark.DataFrame object in pandas_udf , which result to this error .

The conclution:

You can't use sparkSession object , spark.DataFrame object or other Spark distributed objects in udf and pandas_udf, because they are unpickled.

If you meet this error and you are using udf, check it carefully , must be relative problem.

0 讨论(0)
发布评论:

提交评论
- 加载中...

PySpark Throwing error Method __getnewargs__([]) does not exist

PySpark Throwing error Method getnewargs([]) does not exist