Dataset-API analog of JavaSparkContext.wholeTextFiles

孤街浪徒 提交于 2019-12-08 03:58:06

问题


We can call JavaSparkContext.wholeTextFiles and get JavaPairRDD<String, String>, where first String is file name and second String is whole file contents. Is there similar method in Dataset API, or all I can do is to load files into JavaPairRDD and then convert to Dataset (which is working, but I'm looking for non-RDD solution).


回答1:


If you want to use Dataset API then you can use spark.read.text("path/to/files/"). Please check here for API details. Please note that using text() method returns Dataframe in which "Each line in the text files is a new row in the resulting DataFrame". So text() method will provide file content. In order to get the file name you will have to use input_file_name() function.

import static org.apache.spark.sql.functions.input_file_name;
Dataset<Row> ds = spark.read().text("c:\\temp").withColumnRenamed("value", "content").withColumn("fileName", input_file_name());
ds.show(false);

If you want to concatenate rows from same file so it will be like whole file content, you would need to use groupBy function on fileName column with concat_ws and collect_list functions.

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.concat_ws;
import static org.apache.spark.sql.functions.collect_list;
ds = ds.groupBy(col("fileName")).agg(concat_ws("",collect_list(ds.col("content"))).as("content"));
ds.show(false);


来源:https://stackoverflow.com/questions/44651742/dataset-api-analog-of-javasparkcontext-wholetextfiles

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!