问题
I have a 10GB csv file in hadoop cluster with duplicate columns. I try to analyse it in SparkR so I use spark-csv
package to parse it as DataFrame
:
df <- read.df(
sqlContext,
FILE_PATH,
source = "com.databricks.spark.csv",
header = "true",
mode = "DROPMALFORMED"
)
But since df have duplicate Email
columns, if I want to select this column, it would error out:
select(df, 'Email')
15/11/19 15:41:58 ERROR RBackendHandler: select on 1422 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Reference 'Email' is ambiguous, could be: Email#350, Email#361.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
...
I want to keep the first occurrence of Email
column and delete the latter, how can I do that?
回答1:
The best way would be to change the column name upstream ;)
However, it seems that is not possible, so there are a couple of options:
If the case of the columns are different("email" vs "Email") you can turn on case sensitivity:
sql(sqlContext, "set spark.sql.caseSensitive=true")
If the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers:
customSchema <- structType( structField("year", "integer"), structField("make", "string"), structField("model", "string"), structField("comment", "string"), structField("blank", "string")) df <- read.df(sqlContext, "cars.csv", source = "com.databricks.spark.csv", header="true", schema = customSchema)
回答2:
Try renaming the column.
You can select it by position instead of the select
call.
colnames(df)[column number of interest] <- 'deleteme'
Alternatively you could just drop the column directly
newdf <- df[,-x]
Where x is the column number you don't want.
Update:
If the above don't work, you could set header to false and then use the first row to rename columns:
df <- read.df(
sqlContext,
FILE_PATH,
source = "com.databricks.spark.csv",
header = "FALSE",
mode = "DROPMALFORMED"
)
#get first row to use as column names
mycolnames <- df[1,]
#edit the dup column *in situ*
mycolnames[x] <- 'IamNotADup'
colnames(df) <- df[1,]
# drop the first row:
df <- df[-1,]
回答3:
You can also create a new dataframe using toDF
.
Here's the same thing, for pyspark: Selecting or removing duplicate columns from spark dataframe
来源:https://stackoverflow.com/questions/33816481/duplicate-columns-in-spark-dataframe