remove duplicates from a dataframe in pyspark

前端 未结 2 756
Happy的楠姐
Happy的楠姐 2020-12-06 00:27

I\'m messing around with dataframes in pyspark 1.4 locally and am having issues getting the drop duplicates method to work. Keeps returning the error "AttributeEr

相关标签:
2条回答
  • 2020-12-06 00:49

    It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

     (df1 = sqlContext
         .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
         .dropDuplicates())
    
     df1.collect()
    
    0 讨论(0)
  • 2020-12-06 00:50

    if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

    count before dedupe:

    df.count()
    

    do the de-dupe (convert the column you are de-duping to string type):

    from pyspark.sql.functions import col
    df = df.withColumn('colName',col('colName').cast('string'))
    
    df.drop_duplicates(subset=['colName']).count()
    

    can use a sorted groupby to check to see that duplicates have been removed:

    df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)
    
    0 讨论(0)
提交回复
热议问题