Reading multiple JSON records into a Pandas dataframe

前端未结

关注

 4  800

I\'d like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with wo

相关标签:

4条回答

借酒劲吻你

2020-12-04 18:02
++++++++Update++++++++++++++

As of v0.19, Pandas supports this natively (see https://github.com/pandas-dev/pandas/pull/13351). Just run:
```
df=pd.read_json('test.json', lines=True)
```
++++++++Old Answer++++++++++

The existing answers are good, but for a little variety, here is another way to accomplish your goal that requires a simple pre-processing step outside of python so that pd.read_json() can consume the data.
- Install jq https://stedolan.github.io/jq/.
- Create a valid json file with cat test.json | jq -c --slurp . > valid_test.json
- Create dataframe with df=pd.read_json('valid_test.json')
In ipython notebook, you can run the shell command directly from the cell interface with
```
!cat test.json | jq -c --slurp . > valid_test.json
df=pd.read_json('valid_test.json')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

眼角桃花

2020-12-04 18:12

Note: Line separated json is now supported in read_json (since 0.19.0):

In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
   a  b
0  1  2
1  3  4

or with a file/filepath rather than a json string:

pd.read_json(json_file, lines=True)

It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop

In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop

In [23]: test_100 = '\n'.join([test] * 100)

In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop

In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop

In [26]: test_1000 = '\n'.join([test] * 1000)

In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop

In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.

0 讨论(0)

长发绾君心

2020-12-04 18:23
As of Pandas 0.19, read_json has native support for line-delimited JSON:
```
pd.read_json(jsonfile, lines=True)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-12-04 18:26
If you are trying to save memory, then reading the file a line at a time will be much more memory efficient:
```
with open('test.json') as f:
    data = pd.DataFrame(json.loads(line) for line in f)
```
Also, if you import simplejson as json, the compiled C extensions included with simplejson are much faster than the pure-Python json module.
0 讨论(0)
发布评论:

提交评论
- 加载中...