Optimization of json.load() to reduce in-memory usage and time in Python

问题

I have 10K folders each with 200 records in 200 JSON format files. Trying to compile all records into one data frame then finally into a CSV (other format suggestions welcome)

Here is my working solution which takes around 8.3hrs just for the dataframe building process. (Not converting into CSV)

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        finalDf = finalDf.append(df, ignore_index=True)

Any suggestions to optimize this and bring the time down.

回答1:

If the goal is to just write the CSV, you can use multiprocessing to parallelize the read/deserialize/serialize steps and control the file writes with a lock. With a CSV you don't have to hold the whole thing in memory, just append each DF as its generated. If you are using hard drives instead of a ssd, you may also get a boost if the CSV is on a different drive (not just partition).

import multiprocessing as mp
import json
import pandas as pd
from pathlib import Path
import os

def update_csv(args):
    lock, infile, outfile = args
    with open(infile) as f:
        data = json.load(f)
    df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
    with lock:
        with open(outfile, mode="a", newline="") as f:
            df.to_csv(f)

if __name__ == "__main__":
    rootdir ='/path/foldername'
    outfile = 'myoutput.csv'
    if os.path.exists(outfile):
        os.remove(outfile)
    all_files = [str(p) for p in Path(rootdir).rglob('*.json')]
    mgr = mp.Manager()
    lock = mgr.Lock()
    # pool sizing is a bit of a guess....
    with mp.Pool(mp.cpu_count()-1) as pool:
        result = pool.map(update_csv, [(lock, fn, outfile) for fn in all_files],
            chunksize=1)

Personally, I prefer to use a file system lock file for this type of thing but that's platform dependent and you may have problems on some file system types (like a mounted remote file system). multiprocessing.Manager uses background synchronization - I'm not sure if its Lock is efficient or not. But good enough here.... it'll only be a minor % of costs.

回答2:

One important issue comes from the dataframe appending performed in O(n^2). Indeed, for each new processed json file, finalDf is entirely copied!

Here is a modified version running in O(n) time:

%%time
finalDf = pd.DataFrame()
rootdir ='/path/foldername'
all_files = Path(rootdir).rglob('*.json')
allDf = []
for filename in all_files:
    with open(filename, 'r+') as f:
        data = json.load(f)
        df = pd.json_normalize(data).drop(columns=[A]).rename(columns={'B': 'Date'})
        allDf.append(df)
finalDf = pd.concat(allDf, ignore_index=True)

If this not enough, the json parsing and pandas post-processings could be executed in parallel using the multiprocessing module.

来源：https://stackoverflow.com/questions/62905750/optimization-of-json-load-to-reduce-in-memory-usage-and-time-in-python

标签

python

json

dataframe

optimization