I need to create a DataFrame or dictionary. If N = 3
(number of lists inside other list) the expected output is this:
d = {
\'xs0\': [[7.0,
Depending on your input and your expected output (three time the same couple of values in your list for each key?), at least you can replace your for p in plots
by:
for p in plots:
# Select the data you want
df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C']) # selects the minimum of columns needed
df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)] # I have used 123 to simplify, actually the value is an integer variable
df.sort_values(['C'], ascending=[True], inplace=True)
# fill the dictionary
d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
n += 1
At least you save the for index in range(3)
and doing the same operation on your bigger_df
3 times. With timeit
I dropped from 210 ms with your code to 70.5 ms (around a third) with this one.
EDIT: with the way you redefine your question, I think this might do the job you want:
# put this code after the definition of plots
s_list = [123, 145, 35]
# create an empty DF to add your results in the loop
df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2'])
n = 0
for p in plots:
# Select the data you want and sort them on the same line
df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
# on bigger df I would do a bit differently if the isin on F and S are the same for the three plots,
# I would create a df_select_FS outside of the loop before (might be faster)
# Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
# and you add them in you empty df_output in the right column
df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
n += 1
Two notes: first if in your s_list
you have twice the same value, it might not work the way you want, second where the condition are not meet (like in your example 145 in S
) then you have nan
in your df_output