问题:

I understand this is just a very simple question and most likely have been answered somewhere, but as a beginner I still don't get it and am looking for your enlightenment, thank you in advance:

I have a interim dataframe:

+----------------------------+---+ |host                        |day| +----------------------------+---+ |in24.inetnebr.com           |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |uplherc.upl.com             |1  | |ix-esc-ca2-07.ix.netcom.com |1  | |uplherc.upl.com             |1  |

What I need is to remove all the redundant items in host column, in another word, I need to get the final distinct result like:

+----------------------------+---+ |host                        |day| +----------------------------+---+ |in24.inetnebr.com           |1  | |uplherc.upl.com             |1  | |ix-esc-ca2-07.ix.netcom.com |1  | |uplherc.upl.com             |1  |

回答1:

If df is the name of your DataFrame, there are two ways to get unique rows:

df2 = df.distinct()

df2 = df.drop_duplicates()

文章来源: How to get distinct rows in dataframe using pyspark?

标签

dataframe

distinct

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!