Reading files from Apache Spark textFileStream

冷暖自知 提交于 2019-12-11 03:02:33

问题


I'm trying to read/monitor txt files from a Hadoop file system directory. But I've noticed all txt files inside this directory are directories themselves as showed in this example bellow:

/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/_SUCCESS   
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00000
/crawlerOutput/b6b95b75148cdac44cd55d93fe2bbaa76aa5cccecf3d723c5e47d361b28663be-1427922269.txt/part-00001

I'd want read all the data inside the part's files. I'm trying to use the following code as showed in this snippet:

val testData = ssc.textFileStream("/crawlerOutput/*/*")

But, unfortunately it said it doesn't exist /crawlerOutput/*/*. Doesn't textFileStream accept wildcards? What should I do to solve this problem?


回答1:


The textFileStream() is just a wrapper for fileStream() and does not support subdirectories (see https://spark.apache.org/docs/1.3.0/streaming-programming-guide.html).

You would need to list the specific directories to monitor. If you need to detect new directories a StreamingListener could be used to check then stop streaming context and restart with new values.

Just thinking out loud.. If you intend to process each subdirectory once and just want to detect these new directories then potentially key off another location that may contain job info or a file token that once present could be consumed in the streaming context and call the appropriate textFile() to ingest the new path.



来源:https://stackoverflow.com/questions/29401809/reading-files-from-apache-spark-textfilestream

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!