问题
I have one dataframe with two columns:
+--------+-----+
| col1| col2|
+--------+-----+
|22 | 12.2|
|1 | 2.1|
|5 | 52.1|
|2 | 62.9|
|77 | 33.3|
I would like to create a new dataframe which will take only rows where
"value of col1" > "value of col2"
Just as a note the col1 has long type and col2 has double type
the result should be like this:
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
回答1:
Another possible way could be using a where
function of DF.
For example this:
val output = df.where("col1>col2")
will give you the expected result:
+----+----+
|col1|col2|
+----+----+
| 22|12.2|
| 77|33.3|
+----+----+
回答2:
I think the best way would be to simply use "filter".
df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()
+--------+----+
| col1|col2|
+--------+----+
|22 |12.2|
|77 |33.3|
回答3:
you can use sqlContext to simplify the challenge.
first register as temp table as example:
df.createOrReplaceTempView("tbl1")
then run the sql like
sqlContext.sql("select * from tbl1 where col1 > col2")
来源:https://stackoverflow.com/questions/52395986/remove-rows-from-dataframe-based-on-condition-in-pyspark