Loading Large Twitter JSON Data (7GB+) into Python

后端 未结 2 1083
长发绾君心
长发绾君心 2021-01-14 22:24

I\'ve set up a public stream via AWS to collect tweets and now want to do some preliminary analysis. All my data was stored on an S3 bucket (in 5mb files).

I downlo

相关标签:
2条回答
  • 2021-01-14 22:37

    Instead of having the entire file as a JSON object, put one JSON object per line for large datasets!

    To fix the formatting, you should

    1. Remove the [ at the start of the file
    2. Remove the ] at the end of the file
    3. Remove the comma at the end of each line

    Then you can read the file as so:

    with open('one_json_per_line.txt', 'r') as infile:
        for line in infile:
            data_row = json.loads(line)
    

    I would suggest using a different storage if possible. SQLite comes to mind.

    0 讨论(0)
  • 2021-01-14 22:50

    I'm a VERY new user, but I might be able to offer a partial solution. I believe your formatting is off. You can't just import it as JSON without it being in JSON format. You should be able to fix this if you can get the tweets into a data frame (or separate data frames) and then use the "DataFrame.to_json" command. You WILL need Pandas if not already installed.

    Pandas - http://pandas.pydata.org/pandas-docs/stable/10min.html

    Dataframe - http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

    0 讨论(0)
提交回复
热议问题