Dataset-API analog of JavaSparkContext.wholeTextFiles

问题

We can call JavaSparkContext.wholeTextFiles and get JavaPairRDD<String, String>, where first String is file name and second String is whole file contents. Is there similar method in Dataset API, or all I can do is to load files into JavaPairRDD and then convert to Dataset (which is working, but I'm looking for non-RDD solution).

回答1:

If you want to use Dataset API then you can use spark.read.text("path/to/files/"). Please check here for API details. Please note that using text() method returns Dataframe in which "Each line in the text files is a new row in the resulting DataFrame". So text() method will provide file content. In order to get the file name you will have to use input_file_name() function.

import static org.apache.spark.sql.functions.input_file_name;
Dataset<Row> ds = spark.read().text("c:\\temp").withColumnRenamed("value", "content").withColumn("fileName", input_file_name());
ds.show(false);

If you want to concatenate rows from same file so it will be like whole file content, you would need to use groupBy function on fileName column with concat_ws and collect_list functions.

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.concat_ws;
import static org.apache.spark.sql.functions.collect_list;
ds = ds.groupBy(col("fileName")).agg(concat_ws("",collect_list(ds.col("content"))).as("content"));
ds.show(false);

来源：https://stackoverflow.com/questions/44651742/dataset-api-analog-of-javasparkcontext-wholetextfiles

标签

java

apache-spark

dataset

rdd