Efficient pyspark join

前端 未结 2 973
慢半拍i
慢半拍i 2020-11-29 11:35

I\'ve read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I\'ve found are basically:

  • Use a broadcast join if you can. (<
2条回答
  •  感动是毒
    2020-11-29 12:20

    you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table. It was nicely explained by Sim. see link below

    two pass approach to join big dataframes in pyspark

    based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.

    Here is the code.

    from pyspark.sql.functions import *
    emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
    emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
    

    So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition. you can then read and loop through each sub partition data and join both the dataframes and persist them together.

    counter =0;
    paritioncount = 4;
    while counter<=paritioncount:
        query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
        query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
        EMP_DF1 =spark.sql(query1)
        EMP_DF2 =spark.sql(query2)
        df1 = EMP_DF1.alias('df1')
        df2 = EMP_DF2.alias('df2')
        innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
        innerjoin_EMP.show()
        innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
        counter = counter +1
    

    I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.

提交回复
热议问题