Csv Data is not loading properly as Parquet using Spark

孤街浪徒 提交于 2020-08-25 03:42:27


I have a table in Hive

CREATE TABLE tab_data (
  rec_id INT,
  rec_name STRING,
  rec_value DECIMAL(3,1),
  rec_created TIMESTAMP

and I want to populate this table with data in .csv files like these

10|customer1|10.0|2016-09-07  08:38:00.0
20|customer2|24.0|2016-09-08  10:45:00.0
30|customer3|35.0|2016-09-10  03:26:00.0
40|customer1|46.0|2016-09-11  08:38:00.0
50|customer2|55.0|2016-09-12  10:45:00.0
60|customer3|62.0|2016-09-13  03:26:00.0
70|customer1|72.0|2016-09-14  08:38:00.0
80|customer2|23.0|2016-09-15  10:45:00.0
90|customer3|30.0|2016-09-16  03:26:00.0

using Spark and Scala with code as below

import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.types.{DataTypes, IntegerType, StringType, StructField, StructType, TimestampType}

object MainApp {

  val spark = SparkSession

  val sc = spark.sparkContext

  val inputPath = "hdfs://host.hdfs:8020/..../tab_data.csv"
  val outputPath = "hdfs://host.hdfs:8020/...../warehouse/test.db/tab_data"

  def main(args: Array[String]): Unit = {

    try {

      val DecimalType = DataTypes.createDecimalType(3, 1)

        * schema
      val schema = StructType(List(StructField("rec_id", IntegerType, true), StructField("rec_name",StringType, true),
        StructField("rec_value",DecimalType),StructField("rec_created",TimestampType, true)))

        * Reading the data from HDFS 
      val data = spark

      data.show(truncate = false)

        * Writing the data as Parquet

    } finally {

The problem is that I am getting this output

|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |
|null  |null    |null     |null       |

 |-- rec_id: integer (nullable = true)
 |-- rec_name: string (nullable = true)
 |-- rec_value: decimal(3,1) (nullable = true)
 |-- rec_created: timestamp (nullable = true)

The schema is fine but the data is not loading properly in the table

SELECT * FROM tab_data;

| tab_data.rec_id  | tab_data.rec_name  | tab_data.rec_value  | tab_data.rec_created  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |
| NULL             | NULL               | NULL                | NULL                  |

What am I doing wrong?

I'm new with Spark and some help would be appreciated.


To deal with issues between Spark, Hive and Parquet set up your SparkSession as follow:

  val spark = SparkSession
    .config("spark.sql.shuffle.partitions","200") //Change to a more reasonable default number of partitions for our data
    .config("spark.sql.parquet.writeLegacyFormat", true) // To skip issues with data type between Spark and Hive
                                                         // The convention used by Spark to write Parquet data is configurable.
                                                         // This is determined by the property spark.sql.parquet.writeLegacyFormat
                                                         // The default value is false. If set to "true",
                                                         // Spark will use the same convention as Hive for writing the Parquet data.

afterwards read the .csv data as follow

      val data = spark
        .option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") // to read timestamp fields
        .option("inferSchema",false) // by default is false

then write the data as parquet with no compression(by default data is compressed) as follow

        .option("compression", "none") // Assuming no data compression

Note: It's probably that the reason why Hive cannot query the data is because data is compressed in snappy format by default and your CREATE TABLE statement stores the data as parquet without compression.


You are getting null values in all columns because one of the column of type String is not able convert to Timestamp type.

To convert string to timestamp type, specify timestamp format by using this option("timestampFormat","yyyy-MM-dd HH:mm:ss.S") option while loading csv data.

Check below code.


scala> val schema = StructType(List(
   StructField("rec_id", IntegerType, true), 
   StructField("rec_name",StringType, true),
   StructField("rec_created",TimestampType, true))

Loading CSV Data

scala> val df = spark
.option("timestampFormat","yyyy-MM-dd HH:mm:ss.S")

scala> df.show(false)
|rec_id|rec_name |rec_value|rec_created        |
|10    |customer1|10.0     |2016-09-07 08:38:00|
|20    |customer2|24.0     |2016-09-08 10:45:00|
|30    |customer3|35.0     |2016-09-10 03:26:00|
|40    |customer1|46.0     |2016-09-11 08:38:00|
|50    |customer2|55.0     |2016-09-12 10:45:00|
|60    |customer3|62.0     |2016-09-13 03:26:00|
|70    |customer1|72.0     |2016-09-14 08:38:00|
|80    |customer2|23.0     |2016-09-15 10:45:00|
|90    |customer3|30.0     |2016-09-16 03:26:00|


Since table is managed table, You don't need to set all those parameters, You can use insertInto function to insert the data into table.


