Python - How to stream large (11 gb) JSON file to be broken up [duplicate]

て烟熏妆下的殇ゞ 提交于 2019-12-08 05:13:35

问题


I have a very large JSON (11 gb) file that is too large to read into my memory. I would like to break it up into smaller files to analyze the data. I am currently using Python and Pandas for the analysis and I am wondering if there is some way to access chunks of the file so that it can be read into memory without crashing the program. Ideally, I would like to break the years worth of data into smaller manageable files that span about a week, however there isn't a constant data size, although it doesn't matter as much if they are a set interval.

Here is the data format

{
"actor" : 
{
    "classification" : [ "suggested" ],
    "displayName" : "myself",
    "followersCount" : 0,
    "followingCount" : 0,
    "followingStocksCount" : 0,
    "id" : "person:stocktwits:183087",
    "image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png",
    "link" : "http://stocktwits.com/myselfbtc",
    "links" : 
    [

        {
            "href" : null,
            "rel" : "me"
        }
    ],
    "objectType" : "person",
    "preferredUsername" : "myselfbtc",
    "statusesCount" : 2,
    "summary" : null,
    "tradingStrategy" : 
    {
        "approach" : "Technical",
        "assetsFrequentlyTraded" : [ "Forex" ],
        "experience" : "Novice",
        "holdingPeriod" : "Day Trader"
    }
},
"body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
"entities" : 
{
    "chart" : 
    {
        "fullImage" : 
        {
            "link" : "http://charts.stocktwits.com/production/original_10047145.png"
        },
        "image" : 
        {
            "link" : "http://charts.stocktwits.com/production/small_10047145.png"
        },
        "link" : "http://stks.co/iDEB",
        "objectType" : "image"
    },
    "sentiment" : 
    {
        "basic" : "Bearish"
    },
    "stocks" : 
    [

        {
            "displayName" : "Bitcoin",
            "exchange" : "PRIVATE",
            "industry" : null,
            "sector" : null,
            "stocktwits_id" : 9659,
            "symbol" : "BCOIN"
        }
    ],
    "video" : null
},
"gnip" : 
{
    "language" : 
    {
        "value" : "en"
    }
},
"id" : "tag:gnip.stocktwits.com:2012:note/10047145",
"inReplyTo" : 
{
    "id" : "tag:gnip.stocktwits.com:2012:note/10046953",
    "objectType" : "comment"
},
"link" : "http://stocktwits.com/myselfbtc/message/10047145",
"object" : 
{
    "id" : "note:stocktwits:10047145",
    "link" : "http://stocktwits.com/myselfbtc/message/10047145",
    "objectType" : "note",
    "postedTime" : "2012-10-17T19:13:50Z",
    "summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB",
    "updatedTime" : "2012-10-17T19:13:50Z"
},
"provider" : 
{
    "displayName" : "StockTwits",
    "link" : "http://stocktwits.com"
},
"verb" : "post"
}

回答1:


jq 1.5 has a streaming parser (documented at http://stedolan.github.io/jq/manual/#Streaming). In one sense it's easy to use, e.g. if your 1G file is named 1G.json, then the following command will produce a stream of lines, including one line per "leaf" value:

jq -c --stream . 1G.json

(The output is shown below. Notice that each line is itself valid JSON.)

However, using the streamed output may not be so easy, but that depends on what you want to do :-)

The key to understanding the streamed output is that most lines have the form:

[ PATH, VALUE ]

where "PATH" is an array representation of the path. (When using jq, this array can in fact be used as a path.)

[["actor","classification",0],"suggested"]
[["actor","classification",0]]
[["actor","displayName"],"myself"]
[["actor","followersCount"],0]
[["actor","followingCount"],0]
[["actor","followingStocksCount"],0]
[["actor","id"],"person:stocktwits:183087"]
[["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"]
[["actor","link"],"http://stocktwits.com/myselfbtc"]
[["actor","links",0,"href"],null]
[["actor","links",0,"rel"],"me"]
[["actor","links",0,"rel"]]
[["actor","links",0]]
[["actor","objectType"],"person"]
[["actor","preferredUsername"],"myselfbtc"]
[["actor","statusesCount"],2]
[["actor","summary"],null]
[["actor","tradingStrategy","approach"],"Technical"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"]
[["actor","tradingStrategy","assetsFrequentlyTraded",0]]
[["actor","tradingStrategy","experience"],"Novice"]
[["actor","tradingStrategy","holdingPeriod"],"Day Trader"]
[["actor","tradingStrategy","holdingPeriod"]]
[["actor","tradingStrategy"]]
[["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"]
[["entities","chart","fullImage","link"]]
[["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"]
[["entities","chart","image","link"]]
[["entities","chart","link"],"http://stks.co/iDEB"]
[["entities","chart","objectType"],"image"]
[["entities","chart","objectType"]]
[["entities","sentiment","basic"],"Bearish"]
[["entities","sentiment","basic"]]
[["entities","stocks",0,"displayName"],"Bitcoin"]
[["entities","stocks",0,"exchange"],"PRIVATE"]
[["entities","stocks",0,"industry"],null]
[["entities","stocks",0,"sector"],null]
[["entities","stocks",0,"stocktwits_id"],9659]
[["entities","stocks",0,"symbol"],"BCOIN"]
[["entities","stocks",0,"symbol"]]
[["entities","stocks",0]]
[["entities","video"],null]
[["entities","video"]]
[["gnip","language","value"],"en"]
[["gnip","language","value"]]
[["gnip","language"]]
[["id"],"tag:gnip.stocktwits.com:2012:note/10047145"]
[["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"]
[["inReplyTo","objectType"],"comment"]
[["inReplyTo","objectType"]]
[["link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","id"],"note:stocktwits:10047145"]
[["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"]
[["object","objectType"],"note"]
[["object","postedTime"],"2012-10-17T19:13:50Z"]
[["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"]
[["object","updatedTime"],"2012-10-17T19:13:50Z"]
[["object","updatedTime"]]
[["provider","displayName"],"StockTwits"]
[["provider","link"],"http://stocktwits.com"]
[["provider","link"]]
[["verb"],"post"]
[["verb"]]



回答2:


I think you need something like a stream parser. ijson may work:

https://changelog.com/ijson-parse-streams-of-json-in-python/



来源:https://stackoverflow.com/questions/31975345/python-how-to-stream-large-11-gb-json-file-to-be-broken-up

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!