问题
I have a temporary view with only 1
record/value and I want to use that value to calculate the age of the customers present in another big table (with 100
M rows). I used a CROSS JOIN
clause, which is resulting in a performance issue.
Is there a better approach to implement this requirement which is will perform better ? Will a broadcast
hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ?
Reference table: (contains only 1
value)
create temporary view ref
as
select to_date(refdt, 'dd-MM-yyyy') as refdt --returns only 1 value
from tableA
where logtype = 'A';
Cust table (10 M rows):
custid | birthdt
A1234 | 20-03-1980
B3456 | 09-05-1985
C2356 | 15-12-1990
Query (calculate age w.r.t birthdt
):
select
a.custid,
a.birthdt,
cast((datediff(b.ref_dt, a.birthdt)/365.25) as int) as age
from cust a
cross join ref b;
My question is - Is there a better approach to implement this requirement ?
Thanks
回答1:
Simply use withColumn
!
df.withColumn("new_col", lit("10-05-2020").cast("date"))
回答2:
Inside view you are using constant
value, You can simply put same value in below query without cross join.
select
a.custid,
a.birthdt,
cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age
from cust a;
scala> spark.sql("select * from cust").show(false)
+------+----------+
|custid|birthdt |
+------+----------+
|A1234 |1980-03-20|
|B3456 |1985-05-09|
|C2356 |1990-12-15|
+------+----------+
scala> spark.sql("select a.custid, a.birthdt, cast((datediff(to_date('10-05-2020', 'dd-MM-yyyy'), a.birthdt)/365.25) as int) as age from cust a").show(false)
+------+----------+---+
|custid|birthdt |age|
+------+----------+---+
|A1234 |1980-03-20|40 |
|B3456 |1985-05-09|35 |
|C2356 |1990-12-15|29 |
+------+----------+---+
回答3:
Hard to work out exactly your point, but if you cannot use Scala
or pyspark
and dataframes
with .cache
etc. then I think that instead of of using a temporary view
, just create a single row table
. My impression is you are using Spark %sql in a notebook on, say, Databricks.
This is my suspicion as it were.
That said a broadcastjoin
hint may well mean the optimizer only sends out 1 row. See https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-hint-framework.html#specifying-query-hints
来源:https://stackoverflow.com/questions/63233707/cross-join-for-calculation-in-spark-sql