Creating a custom cumulative sum that calculates the downstream quantities given a list of locations and their order

╄→尐↘猪︶ㄣ 提交于 2021-02-08 04:41:42

问题


I am trying to come up with some code that will essentially calculate the cumulative value at locations below it. Taking the cumulative sum almost accomplishes this, but some locations contribute to the same downstream point. Additionally, the most upstream points (or starting points) will not have any values contributing to them and can remain their starting value in the final cumulative DataFrame.

Let's say I have the following DataFrame for each site.

df = pd.DataFrame({
"Site 1": np.random.rand(10),
"Site 2": np.random.rand(10),
"Site 3": np.random.rand(10),
"Site 4": np.random.rand(10),
"Site 5": np.random.rand(10)})

I also have a table of data that has each site and its corresponding downstream component.

df_order = pd.DataFrame({
    "Site 1": Site 3,
    "Site 2": Site 3,
    "Site 3": Site 4,
    "Site 4": Site 5,
    "Site 5": None})

I want to do the following:

1) Sum the values upstream values to get cumulative sum on the respective downstream value. For instance, Site 1 and Site 2 contribute to the value at Site 3. So, I want to add Site 1, Site 2, and Site 3 together to get a cumulative value at Site 3.

2) Now that I have that cumulative value at Site 3, I want to save that cumulative value to Site 3 in "df". Now I want to propagate that value to Site 4, save it by updating the DataFrame, and then proceed to Site 5.

I can get close-ish using cumsum to get the cumulative value at each site, like this:

df = df.cumsum(axis=1)

However, this does not take into account that Site 1 and Site 2 are contributing to Site 3, and not each other.

Well, I can solve this manually using:

df['Site 3'] = df.loc[:,'Site 1':'Site 3'].sum(axis = 1)
df['Site 4'] = df.loc[:,'Site 3':'Site 4'].sum(axis = 1)
df['Site 5'] = df.loc[:,'Site 4':'Site 5'].sum(axis = 1)

However, my actual list of sites is much more extensive and the manual method doesn't automatically take into account the "df_order" provided. Is there a way to logically link the "df_order" DataFrame in such a way that it can calculate this automatically? I know how to do this manually, how would I expand this to be able to handle a larger DataFrame and order of sites?

Think of a larger DataFrame, potentially up to 50 sites, that looks like:

df_order = pd.DataFrame({
    "Site 1": Site 3,
    "Site 2": Site 3,
    "Site 3": Site 4,
    "Site 4": Site 5,
    "Site 5": Site 8,
    "Site 6": Site 8,
    "Site 7": Site 8,
    "Site 8": Site 9,
    "Site 9": None})

回答1:


You can use networkx to deal with the relationships. First, make your order DataFrame like:

print(df_order)
   source  target
0  Site 1  Site 3
1  Site 2  Site 3
2  Site 3  Site 4
3  Site 4  Site 5
4  Site 5    None

Create the directed graph

import networkx as nx
G = nx.from_pandas_edgelist(df_order.dropna(), 
                            source='source', target='target', 
                            create_using=nx.DiGraph)

nx.draw(G, with_labels=True)


With this directed graph you want to get all of the predecessors. We can do this recursively. (Your graph should be a Directed Acyclic Graph, otherwise recursion runs into trouble)

def all_preds(G, target):
    preds=[target]
    for p in list(G.predecessors(target)):
        preds += all_preds(G, p)
    return preds

#Ex.
all_preds(G, 'Site 4')
['Site 4', 'Site 3', 'Site 1', 'Site 2']

And we can now create you downstream sums looping over the columns output by this function for all of your unique Sites.

pd.concat([
    df[all_preds(G, target)].sum(1).rename(target)
    for target in df_order['source'].unique()
    ], axis=1)

Output using np.random.seed(42)

     Site 1    Site 2    Site 3    Site 4    Site 5
0  0.374540  0.020584  1.006978  1.614522  1.736561
1  0.950714  0.969910  2.060118  2.230642  2.725819
2  0.731994  0.832443  1.856581  1.921633  1.956021
3  0.598658  0.212339  1.177359  2.126245  3.035565
4  0.156019  0.181825  0.793914  1.759546  2.018326
5  0.155995  0.183405  1.124575  1.932972  2.595495
6  0.058084  0.304242  0.562000  0.866613  1.178324
7  0.866176  0.524756  1.905167  2.002839  2.522907
8  0.601115  0.431945  1.625475  2.309708  2.856418
9  0.708073  0.291229  1.045752  1.485905  1.670759


来源:https://stackoverflow.com/questions/60344310/creating-a-custom-cumulative-sum-that-calculates-the-downstream-quantities-given

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!