问题
I am trying to come up with some code that will essentially calculate the cumulative value at locations below it. Taking the cumulative sum almost accomplishes this, but some locations contribute to the same downstream point. Additionally, the most upstream points (or starting points) will not have any values contributing to them and can remain their starting value in the final cumulative DataFrame.
Let's say I have the following DataFrame for each site.
df = pd.DataFrame({
"Site 1": np.random.rand(10),
"Site 2": np.random.rand(10),
"Site 3": np.random.rand(10),
"Site 4": np.random.rand(10),
"Site 5": np.random.rand(10)})
I also have a table of data that has each site and its corresponding downstream component.
df_order = pd.DataFrame({
"Site 1": Site 3,
"Site 2": Site 3,
"Site 3": Site 4,
"Site 4": Site 5,
"Site 5": None})
I want to do the following:
1) Sum the values upstream values to get cumulative sum on the respective downstream value. For instance, Site 1 and Site 2 contribute to the value at Site 3. So, I want to add Site 1, Site 2, and Site 3 together to get a cumulative value at Site 3.
2) Now that I have that cumulative value at Site 3, I want to save that cumulative value to Site 3 in "df". Now I want to propagate that value to Site 4, save it by updating the DataFrame, and then proceed to Site 5.
I can get close-ish using cumsum to get the cumulative value at each site, like this:
df = df.cumsum(axis=1)
However, this does not take into account that Site 1 and Site 2 are contributing to Site 3, and not each other.
Well, I can solve this manually using:
df['Site 3'] = df.loc[:,'Site 1':'Site 3'].sum(axis = 1)
df['Site 4'] = df.loc[:,'Site 3':'Site 4'].sum(axis = 1)
df['Site 5'] = df.loc[:,'Site 4':'Site 5'].sum(axis = 1)
However, my actual list of sites is much more extensive and the manual method doesn't automatically take into account the "df_order" provided. Is there a way to logically link the "df_order" DataFrame in such a way that it can calculate this automatically? I know how to do this manually, how would I expand this to be able to handle a larger DataFrame and order of sites?
Think of a larger DataFrame, potentially up to 50 sites, that looks like:
df_order = pd.DataFrame({
"Site 1": Site 3,
"Site 2": Site 3,
"Site 3": Site 4,
"Site 4": Site 5,
"Site 5": Site 8,
"Site 6": Site 8,
"Site 7": Site 8,
"Site 8": Site 9,
"Site 9": None})
回答1:
You can use networkx to deal with the relationships. First, make your order DataFrame like:
print(df_order)
source target
0 Site 1 Site 3
1 Site 2 Site 3
2 Site 3 Site 4
3 Site 4 Site 5
4 Site 5 None
Create the directed graph
import networkx as nx
G = nx.from_pandas_edgelist(df_order.dropna(),
source='source', target='target',
create_using=nx.DiGraph)
nx.draw(G, with_labels=True)
With this directed graph you want to get all of the predecessors
. We can do this recursively. (Your graph should be a Directed Acyclic Graph, otherwise recursion runs into trouble)
def all_preds(G, target):
preds=[target]
for p in list(G.predecessors(target)):
preds += all_preds(G, p)
return preds
#Ex.
all_preds(G, 'Site 4')
['Site 4', 'Site 3', 'Site 1', 'Site 2']
And we can now create you downstream sums looping over the columns output by this function for all of your unique Sites.
pd.concat([
df[all_preds(G, target)].sum(1).rename(target)
for target in df_order['source'].unique()
], axis=1)
Output using np.random.seed(42)
Site 1 Site 2 Site 3 Site 4 Site 5
0 0.374540 0.020584 1.006978 1.614522 1.736561
1 0.950714 0.969910 2.060118 2.230642 2.725819
2 0.731994 0.832443 1.856581 1.921633 1.956021
3 0.598658 0.212339 1.177359 2.126245 3.035565
4 0.156019 0.181825 0.793914 1.759546 2.018326
5 0.155995 0.183405 1.124575 1.932972 2.595495
6 0.058084 0.304242 0.562000 0.866613 1.178324
7 0.866176 0.524756 1.905167 2.002839 2.522907
8 0.601115 0.431945 1.625475 2.309708 2.856418
9 0.708073 0.291229 1.045752 1.485905 1.670759
来源:https://stackoverflow.com/questions/60344310/creating-a-custom-cumulative-sum-that-calculates-the-downstream-quantities-given