问题
I have a DataFrame similar to this example:
Timestamp | Word | Count
30/12/2015 | example_1 | 3
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
27/12/2015 | example_3 | 7
... | ... | ...
and i want to split this data frame by 'word' column's values to obtain a "list" of DataFrame (to plot some figures in a next step). For example:
DF1
Timestamp | Word | Count
30/12/2015 | example_1 | 3
DF2
Timestamp | Word | Count
29/12/2015 | example_2 | 1
28/12/2015 | example_2 | 9
DF3
Timestamp | Word | Count
27/12/2015 | example_3 | 7
Is there a way to do this with PySpark (1.6)?
回答1:
It won't be efficient but you can map with filter over the list of unique values:
words = df.select("Word").distinct().flatMap(lambda x: x).collect()
dfs = [df.where(df["Word"] == word) for word in words]
Post Spark 2.0
words = df.select("Word").distinct().rdd.flatMap(lambda x: x).collect()
回答2:
In addition to what zero323 said, I would might add
word.persist()
before the creation of the dfs, so the "words" dataframe won't need to be transformed each time when you will have an action on each of your "dfs"
来源:https://stackoverflow.com/questions/35190109/pyspark-split-filter-dataframe-by-columns-values