What is the fastest and most efficient way to append rows to a DataFrame?

前端 未结 1 1599
野趣味
野趣味 2020-12-09 13:43

I have a large dataset which I have to convert to .csv format, I have 29 columns and more than a million lines. I am using python and pandas dataframe to handle this job. I

相关标签:
1条回答
  • 2020-12-09 14:05

    As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

    import pandas as pd
    import numpy as np
    import time
    import random
    
    end_value = 10000
    

    Measurement for creating dictionary and at the end load all into data frame

    start_time = time.time()
    dictinary_list = []
    for i in range(0, end_value, 1):
        dictionary_data = {k: random.random() for k in range(30)}
        dictinary_list.append(dictionary_data)
    
    df_final = pd.DataFrame.from_dict(dictinary_list)
    
    end_time = time.time()
    print('Execution time = %.6f seconds' % (end_time-start_time))
    

    Execution time = 0.090153 seconds

    Measurements for appending data into list and concat into data frame:

    start_time = time.time()
    appended_data = []
    for i in range(0, end_value, 1):
        data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
        appended_data.append(data)
    
    appended_data = pd.concat(appended_data, axis=0)
    
    end_time = time.time()
    print('Execution time = %.6f seconds' % (end_time-start_time))
    

    Execution time = 4.183921 seconds

    Measurements for appending data frames:

    start_time = time.time()
    df_final = pd.DataFrame()
    for i in range(0, end_value, 1):
        df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
        df_final = df_final.append(df)
    
    end_time = time.time()
    print('Execution time = %.6f seconds' % (end_time-start_time))
    

    Execution time = 11.085888 seconds

    Measurements for insert data by usage of loc:

    start_time = time.time()
    df = pd.DataFrame(columns=list('A'*30))
    for i in range(0, end_value, 1):
        df.loc[i] = list(np.random.randint(0, 100, size=30))
    
    
    end_time = time.time()
    print('Execution time = %.6f seconds' % (end_time-start_time))
    

    Execution time = 21.029176 seconds

    0 讨论(0)
提交回复
热议问题