How to calculate a formula that takes different columns of a dataframe with the same suffix in the name and create a new column?

走远了吗. 提交于 2021-02-11 12:40:52

问题


I have a dataframe in R that contains the following columns structure (in a bigger scale):

  Material_code  actual_202009  actual_202010  actual_202011  pred_202009  pred_202010  pred_202011  
      111              30              44              24            25           52           27
      112              19              70              93            23           68           100

I would like to add new columns to the dataframe containing the respective error measure:

|actual - pred|/ actual * 100%

Obtaining this:

Material_code  actual_202009  actual_202010  actual_202011  pred_202009  pred_202010  pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
      111              30              44              24            25           52          27     16.67%      18.18%       12.5%
      112              19              70              93            23           68          100    21.05%       2.86%        7.52%

I tried to create the new columns using ends_with() to select the previuous, but I am not getting it right. Can you please help?

*** EDIT to include easier way to generate the dataframe

df <- data.frame(Material_code = c(111,112),
                    actual_202009 = c(30,19),
                    actual_202010 = c(44,70),
                    actual_202011 = c(24,93), 
                    pred_202009 = c(25,23),
                    pred_202010 = c(52,68),
                    pred_202011 = c(27,100))

回答1:


Get the column names of all 'actual' and 'pred' columns and you can perform all the mathematical calculations on them directly.

actual_cols <- sort(grep('actual', names(df), value = TRUE))
pred_cols <- sort(grep('pred', names(df), value = TRUE))
new_cols <- sub('pred', 'MAPE', pred_cols)

df[new_cols] <- abs(df[actual_cols] - df[pred_cols])/df[actual_cols] * 100
df

#  Material_code actual_202009 actual_202010 actual_202011 pred_202009
#1           111            30            44            24          25
#2           112            19            70            93          23

#  pred_202010 pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
#1          52          27        16.7       18.18       12.50
#2          68         100        21.1        2.86        7.53

data

df <- structure(list(Material_code = 111:112, actual_202009 = c(30L, 
19L), actual_202010 = c(44L, 70L), actual_202011 = c(24L, 93L
), pred_202009 = c(25L, 23L), pred_202010 = c(52L, 68L), pred_202011 = c(27L, 
100L)), class = "data.frame", row.names = c(NA, -2L))



回答2:


A bit more verbose from the tidyverse:

library(tidyverse)
df %>%
  pivot_longer(cols = -Material_code) %>%
  separate(name, into = c("type", "time"), sep = "_") %>%
  pivot_wider(names_from = type) %>%
  mutate(MAPE = abs(actual - pred)/actual*100) %>%
  pivot_wider(values_from = c(actual, pred, MAPE),
              names_from = time)

gives:

# A tibble: 2 x 10
  Material_code actual_202009 actual_202010 actual_202011 pred_202009 pred_202010 pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
          <int>         <int>         <int>         <int>       <int>       <int>       <int>       <dbl>       <dbl>       <dbl>
1           111            30            44            24          25          52          27        16.7       18.2        12.5 
2           112            19            70            93          23          68         100        21.1        2.86        7.53



回答3:


You will help yourself a lot if you try to keep your data in long format: each column has the same kind of data. Your table is in wide format, very useful for excel and human visualization, but very cumbersome to deal with in code.

So the first thing you need to do (that's what @deschen did in their answer) is converting your data to long, and then operate on it. A long version of your data will be of the form

Material_code    Type    Date   Value
          111  actual  202011      30

I will provide a data.table solution, that is basically the same as @deschen's. You may like this one for its speed on large data.

library(data.table)

setDT(df1)

df1[, melt(.SD, 1)][, 
               c("type", "date") := tstrsplit(variable, "_", fixed = TRUE)][,
                     dcast(.SD, Material_code + date ~ type)][, 
                         mape := 100 * abs(actual - pred) / actual][]
  • melt(.SD, 1) converts your table from wide to long, keeping only the first column as reference for each record.
  • c("type", "date") := tstrsplit(variable, "_", fixed = TRUE) creates columns type and date with the corresponding values taken from variable (after melting, variable has the former column names).
  • dcast(.SD, Material_code + date ~ type) converts the long table into wide again. This time, Material_code and date will be kept in columns, and type will be casted into new columns actual and pred.
  • The := is an assignment operator. It creates variable mape and assigns the resulting value.
  • The last bit, [] isn't actually needed. Is there so the result is printed to screen. If you don't need to print the new table to screen, omit it.


来源:https://stackoverflow.com/questions/65935389/how-to-calculate-a-formula-that-takes-different-columns-of-a-dataframe-with-the

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!