How to load the csv file into the Spark DataFrame with Array[Int]

点点圈 提交于 2020-07-05 10:27:04

问题


Every row in my csv file is structured like this:

u001, 2013-11, 0, 1, 2, ... , 99

in which u001 and 2013-11 are UID and date, the number from 0 to 99 are the data value. I want to load this csv file into the Spark DataFrame in this structure:

+-------+-------------+-----------------+
|    uid|         date|       dataVector|
+-------+-------------+-----------------+
|   u001|      2013-11|  [0,1,...,98,99]|
|   u002|      2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+

root
 |-- uid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- dataVecotr: array (nullable = true)
 |    |-- element: integer (containsNull = true)

in which dataVector is Array[Int], and the dataVector length is the same for all of the UID and date. I have tried several ways to solve this, including

  1. Using shema

    val attributes = Array("uid", "date", "dataVector)
    val schema = StructType(
    StructField(attributes(0), StringType, true) ::
    StructField(attributes(1), StringType, true) ::
    StructField(attributes(2), ArrayType(IntegerType), true) :: 
    Nil)
    

But this way didn't work well. For the column of data is larger than 100 in my later dataset, I think it is also inconvenience to create the schema including the whole columns of dataVector manually.

  1. Directly load the csv file without schema, and use the method in concatenate multiple columns into single columns to concatenate the column of the data together, but the schema structure is like

     root
      |-- uid: string (nullable = true)
      |-- date: string (nullable = true)
      |-- dataVector: struct (nullable = true)
      |    |-- _c3: string (containsNull = true)
      |    |-- _c4: string (containsNull = true)
      .
      .
      .
      |    |-- _c101: string (containsNull = true)
    

This is still different from what I need, and I didn't find way to convert this struct into what I need. So my question is that how could I load the csv file into the structure what I need?


回答1:


Load it without any additions

val df = spark.read.csv(path)

and select:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

// Combine data into array
val dataVector: Column = array(
  df.columns.drop(2).map(col): _*  // Skip first 2 columns
).cast("array<int>")  // Cast to the required type
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector

df.select(cols: _*).toDF("uid", "date", "dataVector")


来源:https://stackoverflow.com/questions/47824961/how-to-load-the-csv-file-into-the-spark-dataframe-with-arrayint

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!