问题
I have two inputs in a dataframe, and I need to create an output that depends on both inputs (same row, different columns), but also on its previous value (same column, previous row).
This dataframe command will create an example of what I need:
df=pd.DataFrame([[0,0,0], [0,1,0], [0,0,0], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0], [0,1,0], [1,1,1], [1,1,1], [0,1,1], [0,1,1], [1,1,1], [0,1,1], [0,1,1], [0,0,0], [0,1,0]], columns=['input_1', 'input_2', 'output'])
The rules are simple:
- If input_1 is 1, output is 1 (input_1 is a trigger function)
- output will remain as 1 as long as input_2 is also 1. (input_2 works kind of like a memory function)
- For all the others, output will be 0
The rows go in sequence as they happen in time, I mean, row 0 output influences row 1 output, row 1 output influences row 2 output, and so on. So output depends on input_1, input_2, but also on its own previous value.
I could code it looping through the dataframe, computing and assigning values using iloc, but it is painfully slow. I need to run this through many thousands of rows for tens of thousands of dataframes, so I am looking for the most efficient way to do it (preferably vectorization). It can be with numpy or other library/method that you know.
I searched and found some questions about vectorization and row-looping, but I still don't see how to use those techniques. Example questions: How to iterate over rows in a DataFrame in Pandas?. Also this one, What is the most efficient way to loop through dataframes with pandas?
I appreciate your help
回答1:
As you explained in the discussion above we have just two inputs loaded using pandas dataframe:
df=pd.DataFrame([[0,0], [0,1], [0,0], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])
We have to create outputs using following rules:
#1 if input_1 is one the output is one
#2 if both inputs is zero the output is zero
#3 if input_1 is zero and input_2 is one the output holds the previous value
#4 the initial output value is zero
to generate outputs we can
- duplicate input_1 to the output
- update output with previous value if input_1 is zero and input_2 is one
because of the rules above we don't need to update the first output
df['output'] = df.input_1
for idx, row in df.iterrows():
if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
df.output[idx] = df.output[idx-1]
print(df)
The output is:
>>> print(df)
input_1 input_2 output
0 0 0 0
1 0 1 0
2 0 0 0
3 1 1 1
4 0 1 1
5 0 1 1
6 0 0 0
7 0 1 0
8 0 1 0
9 1 1 1
10 1 1 1
11 0 1 1
12 0 1 1
13 1 1 1
14 0 1 1
15 0 1 1
16 0 0 0
17 0 1 0
UPDATE1
The more fast way to do it is modification of formula proposed by @Andrej
df['output_2'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
Without modification his formula creates wrong output for input combination [1, 0]. It holds the previous output instead of setting it to 1.
UPDATE2
This just to compare results
df=pd.DataFrame([[0,0], [1,0], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1], [0,1], [1,1], [1,1], [0,1], [0,1], [1,1], [0,1], [0,1], [0,0], [0,1]], columns=['input_1', 'input_2'])
df['output'] = df.input_1
for idx, row in df.iterrows():
if (idx > 0) and (row.input_1 == 0) and (row.input_2 == 1):
df.output[idx] = df.output[idx-1]
df['output_1'] = (df['input_1'] + df['input_2'] * 2).replace(2, np.nan).ffill().replace(3, 1).astype(int)
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)
The results is:
>>> print(df)
input_1 input_2 output output_1 output_2
0 0 0 0 0 0
1 1 0 1 1 0
2 0 1 1 1 0
3 1 1 1 1 1
4 0 1 1 1 1
5 0 1 1 1 1
6 0 0 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
9 1 1 1 1 1
10 1 1 1 1 1
11 0 1 1 1 1
12 0 1 1 1 1
13 1 1 1 1 1
14 0 1 1 1 1
15 0 1 1 1 1
16 0 0 0 0 0
17 0 1 0 0 0
回答2:
If I understand you right, you want to know how to compute column output
. You can do for example:
df['output_2'] = (df['input_1'] + df['input_2']).replace(1, np.nan).ffill().replace(2, 1).astype(int)
print(df)
Prints:
input_1 input_2 output output_2
0 0 0 0 0
1 0 1 0 0
2 0 0 0 0
3 1 1 1 1
4 0 1 1 1
5 0 1 1 1
6 0 0 0 0
7 0 1 0 0
8 0 1 0 0
9 1 1 1 1
10 1 1 1 1
11 0 1 1 1
12 0 1 1 1
13 1 1 1 1
14 0 1 1 1
15 0 1 1 1
16 0 0 0 0
17 0 1 0 0
来源:https://stackoverflow.com/questions/59811683/how-to-vectorize-a-function-that-uses-both-row-and-column-elements-of-a-datafram