Reading multiple JSON records into a Pandas dataframe

前端 未结 4 791
粉色の甜心
粉色の甜心 2020-12-04 17:41

I\'d like to know if there is a memory efficient way of reading multi record JSON file ( each line is a JSON dict) into a pandas dataframe. Below is a 2 line example with wo

4条回答
  •  眼角桃花
    2020-12-04 18:12

    Note: Line separated json is now supported in read_json (since 0.19.0):

    In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
    Out[31]:
       a  b
    0  1  2
    1  3  4
    

    or with a file/filepath rather than a json string:

    pd.read_json(json_file, lines=True)
    

    It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

    In [11]: '[%s]' % ','.join(test.splitlines())
    Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
    

    For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

    In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
    1000 loops, best of 3: 977 µs per loop
    
    In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
    1000 loops, best of 3: 282 µs per loop
    
    In [23]: test_100 = '\n'.join([test] * 100)
    
    In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
    1000 loops, best of 3: 1.25 ms per loop
    
    In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
    1000 loops, best of 3: 1.25 ms per loop
    
    In [26]: test_1000 = '\n'.join([test] * 1000)
    
    In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
    100 loops, best of 3: 9.78 ms per loop
    
    In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
    100 loops, best of 3: 3.36 ms per loop
    

    Note: of that time the join is surprisingly fast.

提交回复
热议问题