How to build a pandas dataframe (or dict) in an efficient way by selecting some lists of data from another bigger dataframe?

后端 未结 1 1709
执念已碎
执念已碎 2020-12-22 05:19

I need to create a DataFrame or dictionary. If N = 3 (number of lists inside other list) the expected output is this:

d = {
    \'xs0\': [[7.0,          


        
相关标签:
1条回答
  • 2020-12-22 05:34

    Depending on your input and your expected output (three time the same couple of values in your list for each key?), at least you can replace your for p in plots by:

    for p in plots:
        # Select the data you want
        df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
        df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)]   # I have used 123 to simplify, actually the value is an integer variable
        df.sort_values(['C'], ascending=[True], inplace=True)
        # fill the dictionary
        d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
        d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
        n += 1
    

    At least you save the for index in range(3) and doing the same operation on your bigger_df 3 times. With timeit I dropped from 210 ms with your code to 70.5 ms (around a third) with this one.

    EDIT: with the way you redefine your question, I think this might do the job you want:

    # put this code after the definition of plots
    s_list = [123, 145, 35]
    # create an empty DF to add your results in the loop
    df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2']) 
    n = 0
    for p in plots:
        # Select the data you want and sort them on the same line
        df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
        # on bigger df I would do a bit differently if the isin on F and S are the same for the three plots, 
        # I would create a df_select_FS outside of the loop before (might be faster)
    
        #  Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
        # and you add them in you empty df_output in the right column
        df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
        df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
        n += 1
    

    Two notes: first if in your s_list you have twice the same value, it might not work the way you want, second where the condition are not meet (like in your example 145 in S) then you have nan in your df_output

    0 讨论(0)
提交回复
热议问题