Load CSV data in to Dataframe and convert to Array using Apache Spark (Java)

谁都会走 提交于 2019-12-02 03:00:12

问题


I have a CSV file with below data :

1,2,5  
2,4  
2,3 

I want to load them into a Dataframe having schema of string of array

The output should be like below.

[1, 2, 5]  
[2, 4]  
[2, 3] 

This has been answered using scala here: Spark: Convert column of string to an array

I want to make it happen in Java.
Please help


回答1:


Below is the sample code in Java. You need to read your file using spark.read().text(String path) method and then call the split function.

import static org.apache.spark.sql.functions.split;

public class SparkSample {
    public static void main(String[] args) {
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkSample")
                .master("local[*]")
                .getOrCreate();
        //Read file
        Dataset<Row> ds = spark.read().text("c://tmp//sample.csv").toDF("value");
        ds.show(false);     
        Dataset<Row> ds1 = ds.select(split(ds.col("value"), ",")).toDF("new_value");
        ds1.show(false);
        ds1.printSchema();
    }
}



回答2:


you can use VectorAssembler class to create as array of features, which is particulary useful with pipelines:

val assembler = new VectorAssembler()
  .setInputCols(Array("city", "status", "vendor"))
  .setOutputCol("features")

https://spark.apache.org/docs/2.2.0/ml-features.html#vectorassembler



来源:https://stackoverflow.com/questions/47687194/load-csv-data-in-to-dataframe-and-convert-to-array-using-apache-spark-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!