I\'m messing around with dataframes in pyspark 1.4 locally and am having issues getting the drop duplicates method to work. Keeps returning the error "AttributeEr
It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:
(df1 = sqlContext
.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
.dropDuplicates())
df1.collect()
if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):
count before dedupe:
df.count()
do the de-dupe (convert the column you are de-duping to string type):
from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))
df.drop_duplicates(subset=['colName']).count()
can use a sorted groupby to check to see that duplicates have been removed:
df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)