Working with duplicated columns in SparkR

安稳与你 提交于 2019-12-12 03:55:22

问题


I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.

  • I need to infer the schema whenever I can (so detect integers etc).
  • I need to assume that I can't hard-code the schema (unknown number of columns in each file or can't infer schema from column name alone).
  • I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.

I load them like so:

df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")

It loads OK, but when I try to run any sort of job (even a simple count()) it fails:

  java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc

I tried renaming the headers in the schema with:

new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema

But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.

I feel like there is a really easy way, apologies in advance if there is. Any suggestions?


回答1:


the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.

df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌​) 


来源:https://stackoverflow.com/questions/35844301/working-with-duplicated-columns-in-sparkr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!