How to get distinct rows in dataframe using pyspark?

前提是你 提交于 2019-12-19 05:23:30

问题


I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:

I have a interim dataframe:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:

+----------------------------+---+
|host                        |day|
+----------------------------+---+
|in24.inetnebr.com           |1  |
|uplherc.upl.com             |1  |
|ix-esc-ca2-07.ix.netcom.com |1  |
|uplherc.upl.com             |1  |

回答1:


If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct()

or

df2 = df.drop_duplicates()



回答2:


The normal distinct not so user friendly, because you cant set the column. In this case enough for you:

df = df.distinct()

but if you have other value in date column, you wont get back the distinct elements from host:

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
|     uplherc.upl.com|  1|
+--------------------+---+

after distinct you will get back as follows:

df.distinct().show()

+--------------------+---+
|                host|day|
+--------------------+---+
|   in24.inetnebr.com|  1|
|     uplherc.upl.com|  2|
|     uplherc.upl.com|  1|
|ix-esc-ca2-07.ix....|  1|
+--------------------+---+

thus you should use this:

df = df.dropDuplicates(['host'])

it will keep the first value of day

If you familiar with SQL language it will also works for you:

df.createOrReplaceTempView("temp_table")
new_df = spark.sql("select first(host), first(day) from temp_table GROUP BY host")

 +--------------------+-----------------+
|  first(host, false)|first(day, false)|
+--------------------+-----------------+
|   in24.inetnebr.com|                1|
|ix-esc-ca2-07.ix....|                1|
|     uplherc.upl.com|                1|
+--------------------+-----------------+


来源:https://stackoverflow.com/questions/38649793/how-to-get-distinct-rows-in-dataframe-using-pyspark

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!