问题
I'm using spark 1.6.2 Java APIs to load some data in a Dataframe DF1 that looks like:
Key Value
A v1
A v2
B v3
A v4
Now I need to partition DF1 based on a subset of value in column "Key" and dump each partition to a csv file (using spark-csv).
Desired Output:
A.csv
Key Value
A v1
A v2
A v4
B.csv
Key Value
B v3
At the moment what I'm doing is building an HashMap (myList) containing the subset of values that i need to filter and then iterate through that filtering a different Key each iteration. With the following code I get what I want but I'm wondering if there is a more efficient way to do that:
DF1 = <some operations>.cache();
for (Object filterKey: myList.keySet()) {
DF2 = DF1.filter((String)myList.get(filterKey));
DF2.write().format.format("com.databricks.spark.csv")
.option("header", "true")
.save("/" + filterKey + ".csv");
}
回答1:
You are almost there, you just need to add the partitionBy
, which will partition the files in the way you want.
DF1
.filter{case(key, value) => myList.contains(key))
.write
.partitionBy("key")
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/my/basepath/")
The files will now be stored under "/my/basepath/key=A/", "/my/basepath/key=B/", etc..
来源:https://stackoverflow.com/questions/40691710/partition-a-spark-dataframe-based-on-a-specific-column-and-dump-the-content-of-e