JSON file parsing in Pyspark

后端 未结 2 369
温柔的废话
温柔的废话 2020-12-30 15:31

I am very new to Pyspark. I tried parsing the JSON file using the following code

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlConte         


        
相关标签:
2条回答
  • 2020-12-30 16:00

    Spark >= 2.2:

    You can use multiLine argument for JSON reader:

    spark.read.json(path_to_input, multiLine=True)
    

    Spark < 2.2

    There is almost universal, but rather expensive solution, which can be used to read multiline JSON files:

    • Read data using SparkContex.wholeTextFiles.
    • Drop keys (file names).
    • Pass the result to the DataFrameReader.json.

    As long as there are no other problems with your data it should do the trick:

    spark.read.json(sc.wholeTextFiles(path_to_input).values())
    
    0 讨论(0)
  • 2020-12-30 16:00

    I experienced a similar issue. When Spark is reading the Json file, it expects each line to be a separate JSON object. So it will fail if you will try to load a pretty formatted JSON file. My walk around it was to minify the JSON file that Spark was reading.

    0 讨论(0)
提交回复
热议问题