Self join in spark and apply multiple filter criteria in spark Scala

问题

I want to write Spark code using the Scala language to filter out the rows to fill.
I already have a spark sql query but want to convert it into a Spark Scala code.
In the query I am performing the inner join on a same data frame and and applied some filter criteria such as difference between 2 date filed should be with in the range of 1 to 9.
Spark query is self explanatory hence I am not explaining it.

spark.sql("select * from df1 where Container not in(select a.Container from df1 a inner join df1 b 
on a.ContainerEquipmentNumber = b.ContainerEquipmentNumber
where a.EquipmentType <> b.EquipmentType 
and a.transport_mode = 'Ocean' 
and b.transport_mode = 'Ocean' 
and DATEDIFF(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(a.ETD,'yyyy-MM-dd'),'yyyy-MM-dd')),TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(b.ETD,'yyyy-MM-dd'),'yyyy-MM-dd'))) 
between 1 and 9) order by ContainerEquipmentNumber , ETD desc ")

My Spark code

val DF11 = DF0
val DF22 = DF0

DF11.join(DF22, DF11("ContainerEquipmentNumber") =!=  DF22("ContainerEquipmentNumber")
          && DF11("EquipmentType")===DF22("EquipmentType")==="Ocean"
          && DATEDIFF(DF11("ETD"), DF22("ETD")),
          "inner")

But above code not at all working.
Can someone pls help me to implement Spark Scala code having similar functionality as same as I have the Spark SQL.
Thanks in advance.

|ConsigneeName|Consignee |pre_location_city |pre_location_country|pre_location_region|pre_location_locode|origin_location_city|origin_location_country|origin_location_sitename|origin_location_region|origin_location_locode|destination_location_city|destination_location_country|destination_location_sitename|destination_location_region|destination_location_locode|post_location_city|post_location_country   |post_location_region|post_location_locode|main_transport_mode|pre_transport_mode|post_transport_mode|ContainerEquipmentNumber|EquipmentType|PONumber   |MODSSONumber|Carrier|CarrierName|ETA       |ETD       |Source|Servicetype|ContainerVolume|freight_weight|Shipment_Number|weight_unit|CBLNumber        |Shipper|HBLNumber|TEU |Tradelane|Booking_number|Year|Month|Day|
+-------------+----------+------------------+--------------------+-------------------+-------------------+--------------------+-----------------------+------------------------+----------------------+----------------------+-------------------------+----------------------------+-----------------------------+---------------------------+---------------------------+------------------+------------------------+--------------------+--------------------+-------------------+------------------+-------------------+------------------------+-------------+-----------+------------+-------+-----------+----------+----------+------+-----------+---------------+--------------+---------------+-----------+-----------------+-------+---------+----+---------+--------------+----+-----+---+
|ITC          |GBSYNGGBI |SENEFFE           |Belgium             |EUROPE & AME       |BESEF              |ANTWERP             |Belgium                |null                    |EUROPE & AME          |null                  |CARTAGENA                |Colombia                    |null                         |LATIN AMERICA              |COCTG                      |CARTAGENA         |Columbia                |null                |COCTG               |Ocean              |Truck             |Truck              |TCLU5174641             |20DRY        |G0085381229|ZRH0047428  |DHLU   |null       |2019-05-14|2019-04-30|GBI   |CFSCFS     |3.96           |2115.352      |ZRH0046385     |kg         |DHLU/ANRA12657   |null   |null     |null|null     |null          |2020|6    |19 |
|ITC          |GBSYNGGBI |SCHOENEBECK (ELBE)|Germany             |null               |null               |HAMBURG             |Germany                |null                    |EUROPE & AME          |null                  |CARTAGENA                |Columbia                    |null                         |LATIN AMERICA              |COCTG                      |CARTAGENA         |Columbia                |null                |COCTG               |Ocean              |Truck             |Truck              |FCIU2693429             |20DRY        |G0085405241|ZRH0058227  |HLCU   |null       |2019-12-03|2019-11-17|GBI   |CYCY       |13.92          |10095.04      |ZRH0054021     |kg         |HLCU/RTM191082779|null   |null     |null|null     |null          |2020|6    |19 |
|ITC          |GBSYNGGBI |OINOFYTA          |Greece              |EUROPE & AME       |GROFY              |PIRAEUS             |Greece                 |null                    |EUROPE & AME          |null                  |ALTAMIRA                 |Mexico (East/Gulf Coast)    |null                         |null                       |null                       |MATAMOROS         |Mexico (East/Gulf Coast)|null                |MXMAM               |Ocean              |Truck             |Truck              |UACU4054126             |20DRY        |G0085388341|ZRH0049718  |HLCU   |null       |2019-07-01|2019-05-22|GBI   |CYCY       |27.36          |11209.6       |ZRH0046408     |kg         |HLCU/RTM190441160|null   |null     |null|null     |null          |2020|6    |19 |
|ITC          |CHSYCLEGAL|JINAN             |China               |ASIA PACIFIC       |CNJNN              |QINGDAO             |China                  |null                    |ASIA PACIFIC          |null                  |MELBOURNE                |Australia                   |null                         |ASIA PACIFIC               |AUMEL                      |TOTTENHAM         |Australia               |ASIA PACIFIC        |AUTOT               |Ocean              |Truck             |Truck              |CMAU3159388             |20DRY        |G6500024081|TST1073545  |ANNU   |null       |2019-02-23|2019-02-06|DEX   |CYCY       |20             |20826         |TST0579524     |kg         |ANNU/WDSM006090  |null   |null     |null|null     |null          |2020|6    |19 |
|ITC          |CHSYCLEGAL|Jinan             |China               |null               |null               |QINGDAO             |China                  |null                    |ASIA PACIFIC          |null                  |MELBOURNE                |Australia                   |null                         |ASIA PACIFIC               |AUMEL                      |TOTTENHAM         |Australia               |ASIA PACIFIC        |AUTOT               |Ocean              |Truck             |Truck              |UETU2722010             |20DRY        |G6500029924|TST1135194  |HLCU   |null       |2019-12-03|2019-11-17|DEX   |CYCY       |25             |20826         |TST0606019     |kg         |HLCU/TA1191101846|null   |null     |null|null     |null          |2020|6    |19 |
+-------------+----------+------------------+--------------------+-------------------+-------------------+--------------------+-----------------------+------------------------+----------------------+----------------------+-------------------------+----------------------------+-----------------------------+---------------------------+---------------------------+------------------+------------------------+--------------------+--------------------+-------------------+------------------+-------------------+------------------------+-------------+-----------+------------+-------+-----------+----------+----------+------+-----------+---------------+--------------+---------------+-----------+-----------------+-------+---------+----+---------+--------------+----+-----+---+
only showing top 5 rows

回答1:

   val df0 = Seq(
    ("Container1", "Etype1", "2020-01-01", "Ocean"),
    ("Container1", "Etype1", "2020-01-01", "Ocean11"),
    ("Container3", "Etype1", "2020-01-01", "Ocean12"),
    ("Container4", "Etype1", "2020-01-01", "Ocean")
  ).toDF("Container", "EType", "ETD", "transport_mode")

  val df1 = Seq(
    ("Container1", "Etype5", "2020-01-01", "Ocean"),
    ("Container1", "Etype1", "2020-01-01", "Ocean11"),
    ("Container1", "Etype1", "2020-02-01", "Ocean12"),
    ("Container1", "Etype6", "2020-01-01", "Ocean")
  ).toDF("Container", "EType", "ETD", "transport_mode")
    .filter('transport_mode.equalTo("Ocean"))

  val df2 = Seq(
    ("Container1", "Etype1", "2020-01-05", "Ocean"),
    ("Container1", "Etype1", "2020-01-01", "Ocean11"),
    ("Container1", "Etype1", "2020-01-01", "Ocean12"),
    ("Container1", "Etype1", "2020-01-08", "Ocean")
  ).toDF("Container", "EType", "ETD", "transport_mode")
    .filter('transport_mode.equalTo("Ocean"))

  val listContainer = df1.join(df2,
    (df2.col("Container") === df1.col("Container") &&
    df2.col("EType") =!= df1.col("EType") &&
    datediff(to_date(df2.col("ETD"), "yyyy-MM-dd"), to_date(df1.col("ETD"), "yyyy-MM-dd")).between(1, 9))
    , "inner")
    .select(df1.col("Container")).dropDuplicates().as[String].collect().toList

  val resultDF = df0.filter(!'Container.isin(listContainer: _*)).orderBy('Container.asc, 'ETD.desc)

result

resultDF.show(false)
//  +----------+------+----------+--------------+
//  |Container |EType |ETD       |transport_mode|
//  +----------+------+----------+--------------+
//  |Container3|Etype1|2020-01-01|Ocean12       |
//  |Container4|Etype1|2020-01-01|Ocean         |
//  +----------+------+----------+--------------+

来源：https://stackoverflow.com/questions/62521079/self-join-in-spark-and-apply-multiple-filter-criteria-in-spark-scala

标签

scala

apache-spark

apache-spark-sql