Remove rows from dataframe based on condition in pyspark

元气小坏坏 提交于 2019-12-23 19:54:52

问题


I have one dataframe with two columns:

+--------+-----+
|    col1| col2|
+--------+-----+
|22      | 12.2|
|1       |  2.1|
|5       | 52.1|
|2       | 62.9|
|77      | 33.3|

I would like to create a new dataframe which will take only rows where

"value of col1" > "value of col2"

Just as a note the col1 has long type and col2 has double type

the result should be like this:

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|

回答1:


Another possible way could be using a where function of DF.

For example this:

val output = df.where("col1>col2")

will give you the expected result:

+----+----+
|col1|col2|
+----+----+
|  22|12.2|
|  77|33.3|
+----+----+



回答2:


I think the best way would be to simply use "filter".

df_filtered=df.filter(df.col1>df.col2)
df_filtered.show()

+--------+----+
|    col1|col2|
+--------+----+
|22      |12.2|
|77      |33.3|



回答3:


you can use sqlContext to simplify the challenge.

first register as temp table as example: df.createOrReplaceTempView("tbl1") then run the sql like sqlContext.sql("select * from tbl1 where col1 > col2")



来源:https://stackoverflow.com/questions/52395986/remove-rows-from-dataframe-based-on-condition-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!