Remove rows from dataframe based on condition in pyspark

问题

I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

回答1:

Another possible way could be using a where function of DF.

For example this:

val output = df.where("col1>col2")

will give you the expected result:

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+

回答2:

I think the best way would be to simply use "filter".

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

回答3:

you can use sqlContext to simplify the challenge.

first register as temp table as example: df.createOrReplaceTempView("tbl1") then run the sql like sqlContext.sql("select * from tbl1 where col1 > col2")

来源：https://stackoverflow.com/questions/52395986/remove-rows-from-dataframe-based-on-condition-in-pyspark

标签

apache-spark

dataframe

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!