How to load the csv file into the Spark DataFrame with Array[Int]

问题

Every row in my csv file is structured like this:

u001, 2013-11, 0, 1, 2, ... , 99

in which u001 and 2013-11 are UID and date, the number from 0 to 99 are the data value. I want to load this csv file into the Spark DataFrame in this structure:

+-------+-------------+-----------------+
|    uid|         date|       dataVector|
+-------+-------------+-----------------+
|   u001|      2013-11|  [0,1,...,98,99]|
|   u002|      2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+

root
 |-- uid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- dataVecotr: array (nullable = true)
 |    |-- element: integer (containsNull = true)

in which dataVector is Array[Int], and the dataVector length is the same for all of the UID and date. I have tried several ways to solve this, including

Using shema

val attributes = Array("uid", "date", "dataVector)
val schema = StructType(
StructField(attributes(0), StringType, true) ::
StructField(attributes(1), StringType, true) ::
StructField(attributes(2), ArrayType(IntegerType), true) :: 
Nil)

But this way didn't work well. For the column of data is larger than 100 in my later dataset, I think it is also inconvenience to create the schema including the whole columns of dataVector manually.

Directly load the csv file without schema, and use the method in concatenate multiple columns into single columns to concatenate the column of the data together, but the schema structure is like

 root
  |-- uid: string (nullable = true)
  |-- date: string (nullable = true)
  |-- dataVector: struct (nullable = true)
  |    |-- _c3: string (containsNull = true)
  |    |-- _c4: string (containsNull = true)
  .
  .
  .
  |    |-- _c101: string (containsNull = true)

This is still different from what I need, and I didn't find way to convert this struct into what I need. So my question is that how could I load the csv file into the structure what I need?

回答1:

Load it without any additions

val df = spark.read.csv(path)

and select:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

// Combine data into array
val dataVector: Column = array(
  df.columns.drop(2).map(col): _*  // Skip first 2 columns
).cast("array<int>")  // Cast to the required type
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector

df.select(cols: _*).toDF("uid", "date", "dataVector")

来源：https://stackoverflow.com/questions/47824961/how-to-load-the-csv-file-into-the-spark-dataframe-with-arrayint

标签

scala

csv

apache-spark