PySpark - Split/Filter DataFrame by column's values

问题

I have a DataFrame similar to this example:

Timestamp | Word | Count

30/12/2015 | example_1 | 3

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

27/12/2015 | example_3 | 7

... | ... | ...

and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:

DF1

Timestamp | Word | Count

30/12/2015 | example_1 | 3

DF2

Timestamp | Word | Count

29/12/2015 | example_2 | 1

28/12/2015 | example_2 | 9

DF3

Timestamp | Word | Count

27/12/2015 | example_3 | 7

Is there a way to do this with PySpark (1.6)?

回答1:

It won't be efficient but you can map with filter over the list of unique values:

words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]

Post Spark 2.0

words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()

回答2:

In addition to what zero323 said, I would might add

word.persist()

before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"

来源：https://stackoverflow.com/questions/35190109/pyspark-split-filter-dataframe-by-columns-values

标签

python

apache-spark

dataframe

pyspark

apache-spark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!