Spark Streaming : Join Dstream batches into single output Folder

人走茶凉 提交于 2019-12-08 11:44:39

问题


I am using Spark Streaming to fetch tweets from twitter by creating a StreamingContext as :
val ssc = new StreamingContext("local[3]", "TwitterFeed",Minutes(1))

and creating twitter stream as :
val tweetStream = TwitterUtils.createStream(ssc, Some(new OAuthAuthorization(Util.config)),filters)

then saving it as text file
tweets.repartition(1).saveAsTextFiles("/tmp/spark_testing/")

and the problem is that the tweets are being saved as folders based on batch time but I need all the data of each batch in a same folder.

Is there any workaround for it?

Thanks


回答1:


We can do this using Spark SQL's new DataFrame saving API which allow appending to an existing output. By default, saveAsTextFile, won't be able to save to a directory with existing data (see https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes ). https://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations covers how to setup a Spark SQL context for use with Spark Streaming.

Assuming you copy the part from the guide with the SQLContextSingleton, The resulting code would look something like:

data.foreachRDD{rdd =>
  val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
  // Convert your data to a DataFrame, depends on the structure of your data
  val df = ....
  df.save("org.apache.spark.sql.json", SaveMode.Append, Map("path" -> path.toString))
}

(Note the above example used JSON to save the result, but you can use different output formats too).



来源:https://stackoverflow.com/questions/30237877/spark-streaming-join-dstream-batches-into-single-output-folder

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!