How to load directory of JSON files into Apache Spark in Python

前端 未结 4 1555
时光取名叫无心
时光取名叫无心 2021-01-02 13:46

I\'m relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contai

4条回答
  •  日久生厌
    2021-01-02 14:27

    You can use sqlContext.jsonFile() to get a SchemaRDD (which is an RDD[Row] plus a schema) that can then be used with Spark SQL. Or see Loading JSON dataset into Spark, then use filter, map, etc for a non-SQL processing pipeline. I think you may have to unzip the files, and also Spark can only work with files where each line is a single JSON document (i.e., no multiline objects possible).

提交回复
热议问题