Spark Structured Streaming - Read file from Nested Directories

陌路散爱 提交于 2019-12-07 18:38:42

问题


I have a client which places the CSV files in Nested Directories as below, I need to read these files in real-time. I am trying to do this using Spark Structured Streaming.

Data:
/user/data/1.csv
/user/data/2.csv
/user/data/3.csv
/user/data/sub1/1_1.csv
/user/data/sub1/1_2.csv
/user/data/sub1/sub2/2_1.csv
/user/data/sub1/sub2/2_2.csv

Code:

val csvDF = spark
  .readStream
  .option("sep", ",")
  .schema(userSchema)      // Schema of the csv files
  .csv("/user/data/")

Any configurations to be added to allow spark reading from nested directories in Structured Streaming.


回答1:


As far as I know, Spark has no such options, but it supports glob usage in path.

val csvDF = spark
  .readStream
  .option("sep", ",")
  .schema(userSchema)      // Schema of the csv files
  .csv("/user/data/*/*")

Maybe it may help you to design your glob path and use it in one stream.

Hope it helps!




回答2:


I am able to stream the files in sub-directories using glob path.

Posting here for the sake of others.

inputPath = "/spark_structured_input/*?*"
inputDF = spark.readStream.option("header", "true").schema(userSchema).csv(inputPath)
query = inputDF.writeStream.format("console").start()


来源:https://stackoverflow.com/questions/51605098/spark-structured-streaming-read-file-from-nested-directories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!