问题
I have an s3 bucket with nearly 100k gzipped JSON files.
These files are called [timestamp].json
instead of the more sensible [timestamp].json.gz
.
I have other processes that use them so renaming is not an option and copying them is even less ideal.
I am using spark.read.json([pattern])
to read these files. If I rename the filename to contain the .gz
this works fine, but whilst the extension is just .json
they cannot be read.
Is there any way I can tell spark that these files are gzipped?
回答1:
SparkSession can read compressed json file directly, just like this:
val json=spark.read.json("/user/the_file_path/the_json_file.log.gz")
json.printSchema()
来源:https://stackoverflow.com/questions/52253022/can-i-tell-spark-read-json-that-my-files-are-gzipped