How to load directory of JSON files into Apache Spark in Python

前端未结

关注

 4  1565

I\'m relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contai

相关标签:

4条回答

日久生厌

2021-01-02 14:27

You can use sqlContext.jsonFile() to get a SchemaRDD (which is an RDD[Row] plus a schema) that can then be used with Spark SQL. Or see Loading JSON dataset into Spark, then use filter, map, etc for a non-SQL processing pipeline. I think you may have to unzip the files, and also Spark can only work with files where each line is a single JSON document (i.e., no multiline objects possible).

0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2021-01-02 14:31

You can load a directory of files into a single RDD using textFile and it also supports wildcards. That wouldn't give you file names, but you don't seem to need them.

You can use Spark SQL while using basic transformations like map, filter etc. SchemaRDD is also an RDD (in Python, as well as Scala)

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2021-01-02 14:43
Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:
```
my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
my_RDD_dictionaries = my_RDD_strings.map(json.loads)
```
If there's a better or more efficient way to do this, please let me know, but this seems to work.
0 讨论(0)
发布评论:

提交评论
- 加载中...

旧巷少年郎

2021-01-02 14:45

To load list of Json from a file as RDD:

def flat_map_json(x): return [each for each in json.loads(x[1])]   
rdd = sc.wholeTextFiles('example.json').flatMap(flat_map_json)

0 讨论(0)