How to properly iterate through a big json file

后端 未结 1 347
慢半拍i
慢半拍i 2020-12-16 02:12

Dear Stackoverflow community,

I have a 34 GB json file that has many data inside. I tried to import into my mongodb by using mongoimport --file file.json - but it fa

相关标签:
1条回答
  • 2020-12-16 02:26

    You'll want to use a streaming parser. These only pull small portions of your file into memory at a time.

    They come in a couple different flavors: SAX-like push parsers, and pull parsers. XML reader models: SAX versus XML pull parser gives an overview of the difference.


    Push Parser

    This is a quick example using salsify/json-streaming-parser.

    As it rolls through the file we'll keep track of the summonerId, championId, and state. It's all event-based - you don't get random access with a sequential parser so you have to keep track of things yourself. Every time a totalSessionsPlayed comes up it'll echo out the summonerId, championId, and totalSessionsPlayed.


    data.json

    This is a paired-down json file for demonstration purposes.

    [
        {
            "_id": "53b29644aafd413977b23b7e",
            "summonerId": 24570940,
            "region": "euw",
            "stats": {
                "110": {
                    "totalSessionsPlayed": 3,
                    "totalSessionsLost": 2,
                    "totalSessionsWon": 1
                },
                "112": {
                    "totalSessionsPlayed": 45,
                    "totalSessionsLost": 2,
                    "totalSessionsWon": 1
                }
            }
        },
        {
            "_id": "asdfasdfasdf",
            "summonerId": 555555,
            "region": "euw",
            "stats": {
                "42": {
                    "totalSessionsPlayed": 65,
                    "totalSessionsLost": 2,
                    "totalSessionsWon": 1
                },
                "88": {
                    "totalSessionsPlayed": 99,
                    "totalSessionsLost": 2,
                    "totalSessionsWon": 1
                }
            }
        }
    ]
    

    Example:

    class ListMatchUps extends JsonStreamingParser\Listener\IdleListener
    {
    
        private $key;
        private $summonerId;
        private $championId;
        private $inStats;
    
        public function start_document()
        {
            $this->key        = null;
            $this->summonerId = null;
            $this->championId = null;
            $this->inStats    = false;
        }
    
        public function start_object()
        {
            if ($this->key === 'stats') {
                $this->inStats = true;
            } else if ($this->inStats) {
                $this->championId = $this->key;
            }
        }
    
        public function end_object()
        {
            if ($this->championId !== null) {
                $this->championId = null;
            } else if ($this->inStats) {
                $this->inStats = false;
            } else {
                $this->summonerId = null;
            }
        }
    
        public function key($key)
        {
            $this->key = $key;
        }
    
        public function value($value)
        {
            switch ($this->key) {
                case 'summonerId':
                    $this->summonerId = $value;
                    break;
                case 'totalSessionsPlayed':
                    echo "{$this->summonerId},{$this->championId},$value\n";
                    break;
            }
        }
    }
    
    $stream = fopen('data.json', 'r');
    $listener = new ListMatchUps();
    try {
        $parser = new JsonStreamingParser_Parser($stream, $listener);
        $parser->parse();
    } catch (Exception $e) {
        fclose($stream);
        throw $e;
    }
    

    Output:

    24570940,110,3
    24570940,112,45
    555555,42,65
    555555,88,99
    

    Pull Parser

    This is using a parser I recently wrote, pcrov/jsonreader (requires PHP 7.)

    Same data.json as above.

    Example:

    use pcrov\JsonReader\JsonReader;
    
    $reader = new JsonReader();
    $reader->open("data.json");
    
    while($reader->read("summonerId")) {
        $summonerId = $reader->value();
        $reader->next("stats");
        foreach($reader->value() as $championId => $stats) {
            echo "$summonerId, $championId, {$stats['totalSessionsPlayed']}\n";
        }
    }
    $reader->close();
    

    Output:

    24570940, 110, 3
    24570940, 112, 45
    555555, 42, 65
    555555, 88, 99
    
    0 讨论(0)
提交回复
热议问题