Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)

I have a CSV file with below data :

1,2,5  
2,4  
2,3

I want to load them into a Dataframe having schema of string of array

The output should be like below.

[1, 2, 5]  
[2, 4]  
[2, 3]

This has been answered using scala here: Spark: Convert column of string to an array

I want to make it happen in Java.
Please help

Below is the sample code in Java. You need to read your file using spark.read().text(String path) method and then call the split function.

import static org.apache.spark.sql.functions.split;

public class SparkSample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSample")
                .master("local[*]")
                .getOrCreate();
        //Read file
        Dataset<Row> ds = spark.read().text("c://tmp//sample.csv").toDF("value");
        ds.show(false);     
        Dataset<Row> ds1 = ds.select(split(ds.col("value"), ",")).toDF("new_value");
        ds1.show(false);
        ds1.printSchema();
    }
}

you can use VectorAssembler class to create as array of features, which is particulary useful with pipelines:

val assembler = new VectorAssembler()
  .setInputCols(Array("city", "status", "vendor"))
  .setOutputCol("features")

https://spark.apache.org/docs/2.2.0/ml-features.html#vectorassembler

来源：https://stackoverflow.com/questions/47687194/load-csv-data-in-to-dataframe-and-convert-to-array-using-apache-spark-java

标签

java

csv

apache-spark

dataframe

apache-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!