How to load directory of JSON files into Apache Spark in Python

前端 未结 4 1554
时光取名叫无心
时光取名叫无心 2021-01-02 13:46

I\'m relatively new to Apache Spark, and I want to create a single RDD in Python from lists of dictionaries that are saved in multiple JSON files (each is gzipped and contai

4条回答
  •  梦毁少年i
    2021-01-02 14:43

    Following what tgpfeiffer mentioned in their answer and comment, here's what I did.

    First, as they mentioned, the JSON files had to be formatted so they had one dictionary per line rather than a single list of dictionaries. Then, it was as simple as:

    my_RDD_strings = sc.textFile(path_to_dir_with_JSON_files)
    my_RDD_dictionaries = my_RDD_strings.map(json.loads)
    

    If there's a better or more efficient way to do this, please let me know, but this seems to work.

提交回复
热议问题