Cross Join for calculation in Spark SQL

问题

I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100M rows). I used a CROSS JOIN clause, which is resulting in a performance issue.

Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?

Reference table: (contains only 1 value)

create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';

Cust table (10 M rows):

custid | birthdt
A1234  | 20-03-1980
B3456  | 09-05-1985
C2356  | 15-12-1990

Query (calculate age w.r.t birthdt):

select 
a.custid, 
a.birthdt, 
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;

My question is - Is there a better approach to implement this requirement ?

Thanks

回答1:

Simply use withColumn!

df.withColumn("new_col", lit("10-05-2020").cast("date"))

回答2:

Inside view you are using constant value, You can simply put same value in below query without cross join.

select 
a.custid, 
a.birthdt, 
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;

scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt   |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+

scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt   |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+

回答3:

Hard to work out exactly your point, but if you cannot use Scala or pyspark and dataframes with .cache etc. then I think that instead of of using a temporary view, just create a single row table. My impression is you are using Spark %sql in a notebook on, say, Databricks.

This is my suspicion as it were.

That said a broadcastjoin hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints

来源：https://stackoverflow.com/questions/63233707/cross-join-for-calculation-in-spark-sql

标签

apache-spark

apache-spark-sql