问题
I have a use case to minus two dataframes . So i have used the dataframe except() method.
This is working fine locally on smaller set of data.
But when I ran over AWS S3 bucket ,the except() method is not making minus as expected . Is there anything needs to be taken care on distributed environment ?
Does anyone faced this similar issue ?
Here is my sample code
val values = List(List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "A", "Yes")
, List("Two", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "X", "No")
, List("Three", "2017-07-09T23:59:59.000", "2017-12-05T23:59:58.000", "M", "Yes")
, List("Four", "2017-11-01T23:59:59.000", "2017-12-09T23:59:58.000", "A", "No")
, List("Five", "2017-07-09T23:59:59.000", "2017-12-05T23:59:58.000", "", "No")
,List("One", "2017-07-01T23:59:59.000", "2017-11-04T23:59:58.000", "", "No")
)
.map(row => (row(0), row(1), row(2), row(3), row(4)))
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val df = values.toDF("KEY", "ROW_START_DATE", "ROW_END_DATE", "CODE", "Indicator")
val filterCond = (col("ROW_START_DATE") <= "2017-10-31T23:59:59.999" && col("ROW_END_DATE") >= "2017-10-31T23:59:59.999" && col("CODE").isin("M", "A", "R", "G"))
val Filtered = df.filter(filterCond)
val Excluded = df.except(df.filter(filterCond))
Expected Output:
df.show(false)
Filtered.show(false)
Excluded.show(false)
+-----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE |ROW_END_DATE |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A |Yes |
|Two |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X |No |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M |Yes |
|Four |2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A |No |
|Five |2017-07-09T23:59:59.000|2017-12-05T23:59:58.000| |No |
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000| |No |
+-----+-----------------------+-----------------------+----+---------+
+-----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE |ROW_END_DATE |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A |Yes |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M |Yes |
+-----+-----------------------+-----------------------+----+---------+
+----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE |ROW_END_DATE |CODE|Indicator|
+----+-----------------------+-----------------------+----+---------+
|Four|2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A |No |
|Two |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X |No |
|Five|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000| |No |
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000| |No |
+----+-----------------------+-----------------------+----+---------+
But Getting something like below when ran over S3 bucket
Filtered.show(false)
+-----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE |ROW_END_DATE |CODE|Indicator|
+-----+-----------------------+-----------------------+----+---------+
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A |Yes |
|Three|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000|M |Yes |
+-----+-----------------------+-----------------------+----+---------+
Excluded.show(false)
+----+-----------------------+-----------------------+----+---------+
|KEY |ROW_START_DATE |ROW_END_DATE |CODE|Indicator|
+----+-----------------------+-----------------------+----+---------+
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|A |Yes |---> wrong
|Four|2017-11-01T23:59:59.000|2017-12-09T23:59:58.000|A |No |
|Two |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000|X |No |
|Five|2017-07-09T23:59:59.000|2017-12-05T23:59:58.000| |No |
|One |2017-07-01T23:59:59.000|2017-11-04T23:59:58.000| |No |
+----+-----------------------+-----------------------+----+---------+
Is there any other way to perform minus of two spark dataframe ?
回答1:
S3 isn't quite a filesystem, and it can surface in spark
- Try to verify that the data written to s3 is the same as what you get when using a file:// dest. As there's a risk that things get lost on the way.
- Then try putting a Thread.sleep(10000) between writing to s3 and reading; that will show if directory inconsistency is surfacing.
- If you are on EMR, try with their consistent EMR option
- And try with the s3a:// connector
If it doesn't work with s3a:// file a SPARK- JIRA on issues.apache.org, put s3a in the text too, including this code snippet (which is implicitly licensing it to the ASF). I can then replicate it into a test & see if I can see it, and if so, whether it goes away when I turn s3guard on in Hadoop 3.1+
来源:https://stackoverflow.com/questions/49188415/spark-dataframe-except-method-issue