问题
I want to write Spark code using the Scala language to filter out the rows to fill.
I already have a spark sql query but want to convert it into a Spark Scala code.
In the query I am performing the inner join on a same data frame and and applied some filter criteria such as difference between 2 date filed should be with in the range of 1 to 9.
Spark query is self explanatory hence I am not explaining it.
spark.sql("select * from df1 where Container not in(select a.Container from df1 a inner join df1 b
on a.ContainerEquipmentNumber = b.ContainerEquipmentNumber
where a.EquipmentType <> b.EquipmentType
and a.transport_mode = 'Ocean'
and b.transport_mode = 'Ocean'
and DATEDIFF(TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(a.ETD,'yyyy-MM-dd'),'yyyy-MM-dd')),TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(b.ETD,'yyyy-MM-dd'),'yyyy-MM-dd')))
between 1 and 9) order by ContainerEquipmentNumber , ETD desc ")
My Spark code
val DF11 = DF0
val DF22 = DF0
DF11.join(DF22, DF11("ContainerEquipmentNumber") =!= DF22("ContainerEquipmentNumber")
&& DF11("EquipmentType")===DF22("EquipmentType")==="Ocean"
&& DATEDIFF(DF11("ETD"), DF22("ETD")),
"inner")
But above code not at all working.
Can someone pls help me to implement Spark Scala code having similar functionality as same as I have the Spark SQL.
Thanks in advance.
|ConsigneeName|Consignee |pre_location_city |pre_location_country|pre_location_region|pre_location_locode|origin_location_city|origin_location_country|origin_location_sitename|origin_location_region|origin_location_locode|destination_location_city|destination_location_country|destination_location_sitename|destination_location_region|destination_location_locode|post_location_city|post_location_country |post_location_region|post_location_locode|main_transport_mode|pre_transport_mode|post_transport_mode|ContainerEquipmentNumber|EquipmentType|PONumber |MODSSONumber|Carrier|CarrierName|ETA |ETD |Source|Servicetype|ContainerVolume|freight_weight|Shipment_Number|weight_unit|CBLNumber |Shipper|HBLNumber|TEU |Tradelane|Booking_number|Year|Month|Day|
+-------------+----------+------------------+--------------------+-------------------+-------------------+--------------------+-----------------------+------------------------+----------------------+----------------------+-------------------------+----------------------------+-----------------------------+---------------------------+---------------------------+------------------+------------------------+--------------------+--------------------+-------------------+------------------+-------------------+------------------------+-------------+-----------+------------+-------+-----------+----------+----------+------+-----------+---------------+--------------+---------------+-----------+-----------------+-------+---------+----+---------+--------------+----+-----+---+
|ITC |GBSYNGGBI |SENEFFE |Belgium |EUROPE & AME |BESEF |ANTWERP |Belgium |null |EUROPE & AME |null |CARTAGENA |Colombia |null |LATIN AMERICA |COCTG |CARTAGENA |Columbia |null |COCTG |Ocean |Truck |Truck |TCLU5174641 |20DRY |G0085381229|ZRH0047428 |DHLU |null |2019-05-14|2019-04-30|GBI |CFSCFS |3.96 |2115.352 |ZRH0046385 |kg |DHLU/ANRA12657 |null |null |null|null |null |2020|6 |19 |
|ITC |GBSYNGGBI |SCHOENEBECK (ELBE)|Germany |null |null |HAMBURG |Germany |null |EUROPE & AME |null |CARTAGENA |Columbia |null |LATIN AMERICA |COCTG |CARTAGENA |Columbia |null |COCTG |Ocean |Truck |Truck |FCIU2693429 |20DRY |G0085405241|ZRH0058227 |HLCU |null |2019-12-03|2019-11-17|GBI |CYCY |13.92 |10095.04 |ZRH0054021 |kg |HLCU/RTM191082779|null |null |null|null |null |2020|6 |19 |
|ITC |GBSYNGGBI |OINOFYTA |Greece |EUROPE & AME |GROFY |PIRAEUS |Greece |null |EUROPE & AME |null |ALTAMIRA |Mexico (East/Gulf Coast) |null |null |null |MATAMOROS |Mexico (East/Gulf Coast)|null |MXMAM |Ocean |Truck |Truck |UACU4054126 |20DRY |G0085388341|ZRH0049718 |HLCU |null |2019-07-01|2019-05-22|GBI |CYCY |27.36 |11209.6 |ZRH0046408 |kg |HLCU/RTM190441160|null |null |null|null |null |2020|6 |19 |
|ITC |CHSYCLEGAL|JINAN |China |ASIA PACIFIC |CNJNN |QINGDAO |China |null |ASIA PACIFIC |null |MELBOURNE |Australia |null |ASIA PACIFIC |AUMEL |TOTTENHAM |Australia |ASIA PACIFIC |AUTOT |Ocean |Truck |Truck |CMAU3159388 |20DRY |G6500024081|TST1073545 |ANNU |null |2019-02-23|2019-02-06|DEX |CYCY |20 |20826 |TST0579524 |kg |ANNU/WDSM006090 |null |null |null|null |null |2020|6 |19 |
|ITC |CHSYCLEGAL|Jinan |China |null |null |QINGDAO |China |null |ASIA PACIFIC |null |MELBOURNE |Australia |null |ASIA PACIFIC |AUMEL |TOTTENHAM |Australia |ASIA PACIFIC |AUTOT |Ocean |Truck |Truck |UETU2722010 |20DRY |G6500029924|TST1135194 |HLCU |null |2019-12-03|2019-11-17|DEX |CYCY |25 |20826 |TST0606019 |kg |HLCU/TA1191101846|null |null |null|null |null |2020|6 |19 |
+-------------+----------+------------------+--------------------+-------------------+-------------------+--------------------+-----------------------+------------------------+----------------------+----------------------+-------------------------+----------------------------+-----------------------------+---------------------------+---------------------------+------------------+------------------------+--------------------+--------------------+-------------------+------------------+-------------------+------------------------+-------------+-----------+------------+-------+-----------+----------+----------+------+-----------+---------------+--------------+---------------+-----------+-----------------+-------+---------+----+---------+--------------+----+-----+---+
only showing top 5 rows
回答1:
val df0 = Seq(
("Container1", "Etype1", "2020-01-01", "Ocean"),
("Container1", "Etype1", "2020-01-01", "Ocean11"),
("Container3", "Etype1", "2020-01-01", "Ocean12"),
("Container4", "Etype1", "2020-01-01", "Ocean")
).toDF("Container", "EType", "ETD", "transport_mode")
val df1 = Seq(
("Container1", "Etype5", "2020-01-01", "Ocean"),
("Container1", "Etype1", "2020-01-01", "Ocean11"),
("Container1", "Etype1", "2020-02-01", "Ocean12"),
("Container1", "Etype6", "2020-01-01", "Ocean")
).toDF("Container", "EType", "ETD", "transport_mode")
.filter('transport_mode.equalTo("Ocean"))
val df2 = Seq(
("Container1", "Etype1", "2020-01-05", "Ocean"),
("Container1", "Etype1", "2020-01-01", "Ocean11"),
("Container1", "Etype1", "2020-01-01", "Ocean12"),
("Container1", "Etype1", "2020-01-08", "Ocean")
).toDF("Container", "EType", "ETD", "transport_mode")
.filter('transport_mode.equalTo("Ocean"))
val listContainer = df1.join(df2,
(df2.col("Container") === df1.col("Container") &&
df2.col("EType") =!= df1.col("EType") &&
datediff(to_date(df2.col("ETD"), "yyyy-MM-dd"), to_date(df1.col("ETD"), "yyyy-MM-dd")).between(1, 9))
, "inner")
.select(df1.col("Container")).dropDuplicates().as[String].collect().toList
val resultDF = df0.filter(!'Container.isin(listContainer: _*)).orderBy('Container.asc, 'ETD.desc)
result
resultDF.show(false)
// +----------+------+----------+--------------+
// |Container |EType |ETD |transport_mode|
// +----------+------+----------+--------------+
// |Container3|Etype1|2020-01-01|Ocean12 |
// |Container4|Etype1|2020-01-01|Ocean |
// +----------+------+----------+--------------+
来源:https://stackoverflow.com/questions/62521079/self-join-in-spark-and-apply-multiple-filter-criteria-in-spark-scala