SPARK read.json throwing java.io.IOException: Too many bytes before newline

↘锁芯ラ 提交于 2019-12-06 03:25:44

Yep, you have more than Integer.MAX_VALUE bytes in your line. You need to split it up.

Keep in mind that Spark is expecting each line to be a valid JSON document, not the file as a whole. Below is the relevant line from the Spark SQL Progamming Guide

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

So if your JSON document is in the form...

[
  { [record] },
  { [record] }
]

You'll want to change it to

{ [record] }
{ [record] }

I stumbled upon this while reading a huge JSON file in PySpark and getting the same error. So ,if anyone else is also wondering how to save a JSON file in the format that PySpark can read properly, here is a quick example using pandas:

import pandas as pd
from collections import dict

# create some dict you want to dump
list_of_things_to_dump = [1, 2, 3, 4, 5]
dump_dict = defaultdict(list)
for number in list_of_things_to_dump:
    dump_dict["my_number"].append(number)

# save data like this using pandas, will work of the bat with PySpark
output_df = pd.DataFrame.from_dict(dump_dict)
with open('my_fancy_json.json', 'w') as f:
    f.write(output_df.to_json(orient='records', lines=True))

After that loading JSON in PySpark is as easy as:

df = spark.read.json("hdfs:///user/best_user/my_fancy_json.json", schema=schema)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!