PySpark: Partitioning and hashing multiple dataframes, then joining

空扰寡人 提交于 2019-12-11 10:08:48

问题


Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames, which I will join at a later stage after first processing each of these DataFrames individually. I have shown examples of two DataFrames below (df_A and df_B). Similarly, I have multiple DataFrames - df_A, df_B, df_C .... df_J and I will join all of them at a later stage.

df_A = spark.read.schema(schema).format("csv").load(...)....            # Just an example
df_A.show(3)
#Example 1:
+----------+-----------------+
| patientID|   diagnosis_code|
+----------+-----------------+
|       A51|             XIII|
|       B22|               VI|
|       B13|               XV|
+----------+-----------------+
df_B.show(3)
#Example 2:
+-----------+----------+-------+-------------+--------+
|  patientID|  hospital|   city|  doctor_name|    Bill|
+-----------+----------+-------+-------------+--------+
|        A51|   Royal H| London|C.Braithwaite|  451.23|
|        B22|Surgery K.|  Leeds|      J.Small|   88.00|
|        B22|Surgery K.|  Leeds|      J.Small|  102.01|
+-----------+----------+-------+-------------+--------+
print("Number of partitions: {}".format(df_A.rdd.getNumPartitions()))# Num of partitions: 1
print("Partitioner: {}".format(df_A.rdd.partitioner))                # Partitioner: None

Number of partitions: 1 #With other DataFrames I get more partitions.
Partitioner: None

After reading all these .csv/.txt files into DataFrames, I can see that for some DataFrames the data is distributed on just 1 partition (like above), but for others DataFrames, there could be more partitions, depending upon the size of the corresponding .csv/.txt file, which in turn influences number of blocks created (128 MB default size in HDFS). We also don't have a partitioner at the moment.

Question: Now, is it not a good idea to redistribute these DataFrames on multiple partitions, hashed on the basis of patientID, so that we can avoid as much shuffling as possible when we join() these multiple DataFrames? If indeed, this is what is desired, then should I do repartitioning on patientID basis and have a same partitioner for all DataFrames(not sure if it's possible)? I have also read that DataFrame does everything on it's own, but should we not specify hashing according to column patientID?

I will really appreciate if someone can provide some useful links or cues on what optimization strategy should one employ when dealing with these multiple DataFrames, all patientID based.

来源:https://stackoverflow.com/questions/53431989/pyspark-partitioning-and-hashing-multiple-dataframes-then-joining

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!