I have two DataFrames A and B:
A has columns (id, info1, info2) with about 200 Million rows
My default advice on how to optimize joins is:
Use a broadcast join if you can (From your question it seems your tables are large and a broadcast join is not an option). One option in Spark is to perform a broadcast join (aka map-side join in hadoop world). With broadcast join, you can very effectively join a large table (fact) with relatively small tables (dimensions) by avoiding sending all data of the large table over the network.
You can use broadcast function to mark a dataset to be broadcast when used in a join operator. It uses spark.sql.autoBroadcastJoinThreshold setting to control the size of a table that will be broadcast to all worker nodes when performing a join.
Use the same partitioner. If two RDDs have the same partitioner, the join will not cause a shuffle. Note however, that the lack of a shuffle does not mean that no data will have to be moved between nodes. It's possible for two RDDs to have the same partitioner (be co-partitioned) yet have the corresponding partitions located on different nodes (not be co-located). This situation is still better than doing a shuffle, but it's something to keep in mind. Co-location can improve performance, but is hard to guarantee.
If the data is huge and/or your clusters cannot grow such that even (2) above leads to OOM, use a two-pass approach. First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.