A better way to load MongoDB data to a DataFrame using Pandas and PyMongo?

前端未结

关注

 4  1169

故里飘歌 2020-12-29 13:59

I have a 0.7 GB MongoDB database containing tweets that I\'m trying to load into a dataframe. However, I get an error.

MemoryError:

4条回答

死守一世寂寞 (楼主)

2020-12-29 14:19

an elegant way of doing it would be as follows:

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

0 讨论(0)

查看其它4个回答