Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

问题

I have following simple Scala class , which i will later modify to fit some machine learning models.

I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.

The csv file looks like this and its include a Date column as one of the variables.

+-------------------+-------------+-------+---------+-----+
|               Date|      x1     |    y  |      x2 | x3  |       
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29|    57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41|    53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49|     61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24|

Data set link : https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOTPMriGgE

I created a jar file out of this using intelliJ IDEA. And it was successfully done.

object jar1 {
  def main(args: Array[String]): Unit = {


      val sc: SparkSession = SparkSession.builder()
        .appName("SparkByExample")
        .getOrCreate()

       val data = sc.read.format("csv")
      .option("header","true")
      .option("inferSchema","true")
      .load(args(0))

    data.write.format("text").save(args(1))

  }

}

After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .

But it was failed and i got following error message :

ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;

I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.

Can anyone help me to figure this out ?

Updated :

I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :

ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;

Thank you

回答1:

As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:

    df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")

If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON. This is working code example based on your csv file for csv.

val spark = SparkSession.builder
  .appName("Simple Application")
  .config("spark.master", "local")
  .getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._

val df = spark.read
  .format("csv")
  .option("delimiter", ",")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("dateFormat", "yyyy-MM-dd")
  .load("datat.csv")

df.printSchema()
df.show()

df.write
  .format("csv")
  .option("inferSchema", "true")
  .option("header", "true")
  .option("delimiter", "\t")
  .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
  .option("escape", "\\")
  .save("another")

There is no need custom encoder/decoder.

来源：https://stackoverflow.com/questions/61147303/regarding-org-apache-spark-sql-analysisexception-error-when-creating-a-jar-file

标签

scala

apache-spark

intellij-idea

jar

amazon-emr