问题
Suppose I have n number of data frames df_1, df_2, df_3, ... df_n, containing respectively columns named SPEED1 ,SPEED2, SPEED3, ..., SPEEDn, for instance:
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(0,600,100)})
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(0,600,100)})
and I want to make the same changes to all of the data frames. How do I do so by defining a function on similar lines?
def modify(df,nr):
df_invalid_nr=df_nr[df_nr['SPEED'+str(nr)]>500]
df_valid_nr=~df_invalid_nr
Invalid_cycles_nr=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_nr)
print(df)
So, when I try to run the above function
modify(df_1,1)
It returns the entire data frame without modification and the invalid cycles as an empty array. I am guessing I need to define the modification on the global dataframe somewhere in the function for this to work.
I am also not sure if I could do this another way, say just looping an iterator through all the data frames. But, I am not sure it will work.
for i in range(1,n+1):
df_invalid_i=df_i[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
How do I, in general, access df_1 using an iterator? It seems to be a problem.
Any help would be appreciated, thanks!
回答1:
Solution
Inputs
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'SPEED1':np.random.uniform(1,600,100))
df_2 = pd.DataFrame({'SPEED2':np.random.uniform(1,600,100))
Code
To my mind a better approach would be to store your dfs into a list and enumerate over it for augmenting informations into your dfs to create a valid column:
for idx, df in enumerate([df_1, df_2]):
col = 'SPEED'+str(idx+1)
df['valid'] = df[col] <= 500
print(df_1)
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
You can then filter for valid or invalid with df_1[df_1.valid] or df_1[df_1.valid == False]
It is a solution to fit your problem, see Another solution that may be more clean and Notes below for explanations you need.
Another (better?) solution
If it is possible for you re-think your code. Each DataFrame has one column speed, then name it SPEED:
dfs = dict(df_1=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}),
df_2=pd.DataFrame({'SPEED':np.random.uniform(0,600,100)}))
It will allow you to do the following one liner:
dfs = dict(map(lambda key_val: (key_val[0],
key_val[1].assign(valid = key_val[1]['SPEED'] <= 500)),
dfs.items()))
print(dfs['df_1'])
SPEED valid
0 516.395756 False
1 14.643694 True
2 478.085372 True
3 592.831029 False
4 1.431332 True
Explanations:
dfs.items()returns a list of key (i.e. names) and values (i.e. DataFrames)map(foo, bar)apply the function foo (see this answer, and DataFrame assign) to all the elements of bar (i.e. to all the key/value pairs ofdfs.items().dict()cast the map to a dict.
Notes
About modify
Notice that your function modify is not returning anything... I suggest you to have more readings on mutability and immutability in Python. This article is interesting.
You can then test the following for instance:
def modify(df):
df=df[df.SPEED1<0.5]
#The change in df is on the scope of the function only,
#it will not modify your input, return the df...
return df
#... and affect the output to apply changes
df_1 = modify(df_1)
About access df_1 using an iterator
Notice that when you do:
for i in range(1,n+1):
df_i something
df_i in your loop will call the object df_i for each iteration (and not df_1 etc.)
To call an object by its name, use globals()['df_'+str(i)] instead (Assuming that df_1 to df_n+1 are located in globals()) - from this answer.
To my mind it is not a clean approach. I don't know how do you create your DataFrames but if it is possible for your I will suggest you to store them into a dictionary instead affecting manually:
dfs = {}
dfs['df_1'] = ...
or a bit more automatically if df_1 to df_n already exist - according to first part of vestland answer :
dfs = dict((var, eval(var)) for
var in dir() if
isinstance(eval(var), pd.core.frame.DataFrame) and 'df_' in var)
Then it would be easier for your to iterate over your DataFrames:
for i in range(1,n+1):
dfs['df_'+str(i)'] something
回答2:
You can use the globals() function which allows you to get a variable by his name.
I just add df_i = globals()["df_"+str(i)] at the begining of the for loop :
for i in range(1,n+1):
df_i = globals()["df_"+str(i)]
df_invalid_i=df_i.loc[df_i['SPEED'+str(i)]>500]
df_valid_i=~df_invalid_i
Invalid_cycles_i=df[df_invalid]
df=df[df_valid]
print(Invalid_cycles_i)
print(df)
回答3:
Your code sample leaves me a little confused, but focusing on
I want to make the same changes to all of the data frames.
and
How do I, in general, access df_1 using an iterator?
you can do exactly that by organizing your dataframes (dfs) in a dictionary (dict).
Here's how:
Assuming you've got a bunch of variables in your namespace...
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['a', 'b'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['c', 'd'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 2)), columns=['e', 'f'])
df_3 = df_3.set_index(rng)
...you can identify all that are dataframes using:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
If you've got a lot of different dataframes but would only like to focus on those that have a prefix like 'df_', you can identify those by...
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
... and then organize them in a dict using:
myFrames = {}
for dfName in dfNames:
myFrames[dfName] = eval(dfName)
From that list of interesting dataframes, you can subset those that you'd like to do something with. Here's how you focus only on df_1 and df_2:
invalid = ['df_3']
for inv in invalid:
myFrames.pop(inv, None)
Now you can reference ALL your valid dfs by looping through them:
for key in myFrames.keys():
print(myFrames[key])
And that should cover the...
How do I, in general, access df_1 using an iterator?
...part of the question.
And you can of course reference a single dataframe by its name / key in the dict:
print(myFrames['df_1'])
From here you can do something with ALL columns in ALL dataframes.
for key in myFrames.keys():
myFrames[key] = myFrames[key]*10
print(myFrames[key])
Or, being a bit more pythonic, you can specify a lambda function and apply that to a subset of columns
# A function
decimator = lambda x: x/10
# A subset of columns:
myCols = ['SPEED1', 'SPEED2']
Apply that function to your subset of columns in your dataframes of interest:
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in myCols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
print(myFrames[key][col])
So, back to your function...
modify(df_1,1)
... here's my take on it wrapped in a function.
First we'll redefine the dataframes and the function.
Oh, and with this setup, you're going to have to obtain all dfs OUTSIDE your function with alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)].
Here's the datasets and the function for an easy copy-paste:
# Imports
import pandas as pd
import numpy as np
# A few dataframes with random numbers
# df_1
np.random.seed(123)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_1 = df_1.set_index(rng)
# df_2
np.random.seed(456)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_2 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_2 = df_2.set_index(rng)
# df_3
np.random.seed(789)
rows = 12
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_3 = pd.DataFrame(np.random.randint(100,150,size=(rows, 3)), columns=['SPEED1', 'SPEED2', 'SPEED3'])
df_3 = df_3.set_index(rng)
# A function that divides columns by 10
decimator = lambda x: x/10
# A reference to all available dataframes
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
# A function as per your request
def modify(dfs, cols, fx):
""" Define a subset of available dataframes and list of interesting columns, and
apply a function on those columns.
"""
# Subset all dataframes with names that start with df_
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
# Organize those dfs in a dict if they match the dataframe names of interest
myFrames = {}
for dfName in dfNames:
if dfName in dfs:
myFrames[dfName] = eval(dfName)
print(myFrames)
# Apply fx to the cols of your dfs subset
for key in myFrames.keys():
for col in list(myFrames[key]):
if col in cols:
myFrames[key][col] = myFrames[key][col].apply(decimator)
# A testrun. Results in screenshots below
modify(dfs = ['df_1', 'df_2'], cols = ['SPEED1', 'SPEED2'], fx = decimator)
Here are dataframes df_1 and df_2 before manipulation:
Here are the dataframes after manipulation:
Anyway, this is how I would approach it.
Hope you'll find it useful!
来源:https://stackoverflow.com/questions/48536802/iterating-over-different-data-frames-using-an-iterator