What is the fastest and most efficient way to append rows to a DataFrame?

前端未结

关注

 1  1599

I have a large dataset which I have to convert to .csv format, I have 29 columns and more than a million lines. I am using python and pandas dataframe to handle this job. I

相关标签:

1条回答

孤独总比滥情好

2020-12-09 14:05

As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

import pandas as pd
import numpy as np
import time
import random

end_value = 10000

Measurement for creating dictionary and at the end load all into data frame

start_time = time.time()
dictinary_list = []
for i in range(0, end_value, 1):
    dictionary_data = {k: random.random() for k in range(30)}
    dictinary_list.append(dictionary_data)

df_final = pd.DataFrame.from_dict(dictinary_list)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 0.090153 seconds

Measurements for appending data into list and concat into data frame:

start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
    data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    appended_data.append(data)

appended_data = pd.concat(appended_data, axis=0)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 4.183921 seconds

Measurements for appending data frames:

start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
    df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    df_final = df_final.append(df)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 11.085888 seconds

Measurements for insert data by usage of loc:

start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
    df.loc[i] = list(np.random.randint(0, 100, size=30))


end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 21.029176 seconds

0 讨论(0)