问题
I need to covert a column of the Spark dataframe to list to use later for matplotlib
df.toPandas()[col_name].values.tolist()
it looks like there is high performance overhead this operation takes around 18sec is there other way to do that or improve the perfomance?
回答1:
If you really need a local list there is not much you can do here but one improvement is to collect only a single column not a whole DataFrame
:
df.select(col_name).flatMap(lambda x: x).collect()
回答2:
You can do it this way:
>>> [list(row) for row in df.collect()]
Example:>>> d = [['Alice', 1], ['Bob', 2]]
>>> df = spark.createDataFrame(d, ['name', 'age'])
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
| Bob| 2|
+-----+---+
>>> to_list = [list(row) for row in df.collect()]
print list
Result: [[u'Alice', 1], [u'Bob', 2]]
来源:https://stackoverflow.com/questions/35364133/spark-converting-dataframe-to-list-improving-performance