Working with duplicated columns in SparkR

问题

I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.

I need to infer the schema whenever I can (so detect integers etc).
I need to assume that I can't hard-code the schema (unknown number of columns in each file or can't infer schema from column name alone).
I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.

I load them like so:

df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")

It loads OK, but when I try to run any sort of job (even a simple count()) it fails:

  java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc

I tried renaming the headers in the schema with:

new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema

But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.

I feel like there is a really easy way, apologies in advance if there is. Any suggestions?

回答1:

the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.

df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌)

来源：https://stackoverflow.com/questions/35844301/working-with-duplicated-columns-in-sparkr

标签

sparkr