What is an optimized way of joining large tables in Spark SQL

前端 未结 2 1028
不知归路
不知归路 2020-12-02 21:36

I have a need of joining tables using Spark SQL or Dataframe API. Need to know what would be optimized way of achieving it.

Scenario is:

  1. All data is pr
2条回答
  •  日久生厌
    2020-12-02 22:14

    My default advice on how to optimize joins is:

    1. Use a broadcast join if you can (see this notebook). From your question it seems your tables are large and a broadcast join is not an option.

    2. Consider using a very large cluster (it's cheaper that you may think). $250 right now (6/2016) buys about 24 hours of 800 cores with 6Tb RAM and many SSDs on the EC2 spot instance market. When thinking about total cost of a big data solution, I find that humans tend to substantially undervalue their time.

    3. Use the same partitioner. See this question for information on co-grouped joins.

    4. If the data is huge and/or your clusters cannot grow such that even (3) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.

    Side note: I say "appending" above because in production I never use SaveMode.Append. It is not idempotent and that's a dangerous thing. I use SaveMode.Overwrite deep into the subtree of a partitioned table tree structure. Prior to 2.0.0 and 1.6.2 you'll have to delete _SUCCESS or metadata files or dynamic partition discovery will choke.

    Hope this helps.

提交回复
热议问题