How to get distinct rows in dataframe using pyspark?

匿名 (未验证) 提交于 2019-12-03 08:46:08

问题:

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:

I have a interim dataframe:

+----------------------------+---+ |host                        |day| +----------------------------+---+ |in24.inetnebr.com           |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |ix-esc-ca2-07.ix.netcom.com |1  | |uplherc.upl.com             |1  | 

What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:

+----------------------------+---+ |host                        |day| +----------------------------+---+ |in24.inetnebr.com           |1  | |uplherc.upl.com             |1  | |ix-esc-ca2-07.ix.netcom.com |1  | |uplherc.upl.com             |1  | 

回答1:

If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct() 

or

df2 = df.drop_duplicates() 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!