I am very new to Pyspark. I tried parsing the JSON file using the following code
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlConte
Spark >= 2.2:
You can use multiLine argument for JSON reader:
spark.read.json(path_to_input, multiLine=True)
Spark < 2.2
There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:
SparkContex.wholeTextFiles.DataFrameReader.json.As long as there are no other problems with your data it should do the trick:
spark.read.json(sc.wholeTextFiles(path_to_input).values())
I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.