Spark: writing DataFrame as compressed JSON
Apache Spark's DataFrameReader.json() can handle gzipped JSONlines files automatically but there doesn't seem to be a way to get DataFrameWriter.json() to write compressed JSONlines files. The extra network I/O is very expensive in the cloud. Is there a way around this problem? giorgioca The following solutions use pyspark, but I assume the code in Scala would be similar. First option is to set the following when you initialise your SparkConf: conf = SparkConf() conf.set("spark.hadoop.mapred.output.compress", "true") conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop