Spark CSV - No applicable constructor/method found for actual parameters

假装没事ソ 提交于 2019-12-02 04:11:44

TL;DR Define the schema explicitly since the input dataset does not have values to infer types from (for java.sql.Date fields).

For your case, using untyped Dataset API could be a solution (perhaps a workaround and honestly I'd recommend it to avoid unnecessary deserialization from internal row format):

cdr.filter(!$"timestamp".isNull).filter(length($"access") > 0).count

(It's Scala and I'm leaving translating it to Java as a home exercise).

The issue is that you use inferSchema option with most fields unavailable in the input CDR_SAMPLE.csv file that makes most fields of type String (which is the default type when no values are available to infer more specific type).

That makes the fields of type java.sql.Date, i.e. dateParam1 up to dateParam5, of type String.

import org.opencell.spark.model.CDR
import org.apache.spark.sql.Encoders
implicit val cdrEnc = Encoders.bean(classOf[CDR])
val cdrs = spark.read.
  option("inferSchema", "true").
  option("delimiter", ";").
  option("header", true).
  csv("/Users/jacek/dev/sandbox/test-bigdata/CDR_SAMPLE.csv")
scala> cdrs.printSchema
root
 |-- timestamp: timestamp (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- access: string (nullable = true)
 |-- param1: string (nullable = true)
 |-- param2: string (nullable = true)
 |-- param3: string (nullable = true)
 |-- param4: string (nullable = true)
 |-- param5: string (nullable = true)
 |-- param6: string (nullable = true)
 |-- param7: string (nullable = true)
 |-- param8: string (nullable = true)
 |-- param9: string (nullable = true)
 |-- dateParam1: string (nullable = true)
 |-- dateParam2: string (nullable = true)
 |-- dateParam3: string (nullable = true)
 |-- dateParam4: string (nullable = true)
 |-- dateParam5: string (nullable = true)
 |-- decimalParam1: string (nullable = true)
 |-- decimalParam2: string (nullable = true)
 |-- decimalParam3: string (nullable = true)
 |-- decimalParam4: string (nullable = true)
 |-- decimalParam5: string (nullable = true)
 |-- extraParam: string (nullable = true)

Note that the fields of interest, i.e. dateParam1 to dateParam5, are all strings.

 |-- dateParam1: string (nullable = true)
 |-- dateParam2: string (nullable = true)
 |-- dateParam3: string (nullable = true)
 |-- dateParam4: string (nullable = true)
 |-- dateParam5: string (nullable = true)

The issue surfaces when you "pretend" the type of the fields is different by using the encoder as defined in CDR class which says:

private Date dateParam1;
private Date dateParam2;
private Date dateParam3; 
private Date dateParam4; 
private Date dateParam5; 

That's the root cause of the issue. There is a difference between what Spark could infer from the class. Without the conversion the code would've worked, but since you insisted...

cdrs.as[CDR]. // <-- HERE is the issue = types don't match
  filter(cdr => cdr.timestamp != null).
  show // <-- trigger conversion

It does not really matter what field you access in filter operator. The issue is that the conversion takes place that leads to incorrect execution (and whole-stage Java code generation).

I doubt Spark can do much about it since you requested inferSchema with a dataset with no values to use for the type inference. The best bet is to define the schema explicitly and use schema(...) operator to set it.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!