Spark best approach Look-up Dataframe to improve performance

问题

Dataframe A (millions of records) one of the column is create_date,modified_date

Dataframe B 500 records has start_date and end_date

Current approach:

Select a.*,b.* from a join b on a.create_date between start_date and end_date

The above job takes half hour or more to run.

how can I improve the performance

回答1:

DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.

https://issues.apache.org/jira/browse/SPARK-16614

You can use the RDD API to take advantage of the joinWithCassandraTable function

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable

回答2:

As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.

spark.sql.autoBroadcastJoinThreshold

If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

来源：https://stackoverflow.com/questions/39171732/spark-best-approach-look-up-dataframe-to-improve-performance

标签

scala

apache-spark

cassandra

datastax-enterprise

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!