Linear regression with conditional statement in R

问题

I have a huge database and I need to run different regressions with conditional statements. So I see to options to do it: 1) in the regression include the command data subset (industrycodes==12) and 2) I don't obtain the same results as if cut the data to the values when furniture==12. And they should be the same. Could somebody help me with the codes, I think I have a problem with this. I put an example very basic to explain it.

ID  roa   employees    industrycodes
1   0,5      10              12
2   0,3      20              11
3   0,8      15              12
4   0,2      12              12
5   0,7      13              11
6   0,4       8              12

so first I create the subdatabase to compare (when the industry code is 12)

data2<-data1[data1$industrycodes==12,]

and here I run the regressions:

1) for the whole data taking only industrycodes==12 --> here I have the 6 observations

summary(lm(data1$roa~data1$employees, data=subset(data1,industrycodes==12)))

2) cutting the sample when the industrycode==12 --> here of course I have 4 observations

summary(lm(data2$roa~data2$employees),data=data2)

Any ideas of what can be wrong?? Thank you!

回答1:

The problem is that in the first you specify a dataset ( the one called subset(data1,industrycodes==12)) but then run the lm in another datset (data1 - the original one).

An extra comment is that since you use the command data=... in the lm you do not have to specify the variables with the $ , it works as an in the function lm attach command.

try this:

data3<- subset(data1,industrycodes==12)

summary(lm(roa~employees, data=data3) )

Hope it works

回答2:

Welcome to StackOverflow, I am having exactly the same results for both cases, the only thing I changes was to replace the commas "," by points "." to correctly indicate decimal places in roa

data1

  ID roa employees industrycodes
1  1 0.5        10            12
2  2 0.3        20            11
3  3 0.8        15            12
4  4 0.2        12            12
5  5 0.7        13            11
6  6 0.4         8            12

summary(lm(data1$roa~data1$employees, data=subset(data1,industrycodes==12)))
summary(lm(data1$roa~data1$employees, data=data2))

First case results:

    Call:
lm(formula = data1$roa ~ data1$employees, data = subset(data1, 
    industrycodes == 12))

Residuals:
       1        2        3        4        5        6 
 0.01667 -0.18333  0.31667 -0.28333  0.21667 -0.08333 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      4.833e-01  3.742e-01   1.292    0.266
data1$employees -5.918e-18  2.761e-02   0.000    1.000

Residual standard error: 0.259 on 4 degrees of freedom
Multiple R-squared:  8.039e-32, Adjusted R-squared:  -0.25 
F-statistic: 3.215e-31 on 1 and 4 DF,  p-value: 1
data2 <- data1[data1$industrycodes==12,]

Second case results:

summary(lm(data1$roa~data1$employees, data=data2))
Call:
lm(formula = data1$roa ~ data1$employees, data = data2)

Residuals:
       1        2        3        4        5        6 
 0.01667 -0.18333  0.31667 -0.28333  0.21667 -0.08333 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)      4.833e-01  3.742e-01   1.292    0.266
data1$employees -5.918e-18  2.761e-02   0.000    1.000

Residual standard error: 0.259 on 4 degrees of freedom
Multiple R-squared:  8.039e-32, Adjusted R-squared:  -0.25 
F-statistic: 3.215e-31 on 1 and 4 DF,  p-value: 1

If you want to loop across all conditions you could add new columns. For example if you have two conditions:

data1$cond1 <- data1$industrycodes==12
data1$cond2 <- data1$industrycodes<=12

You can then use the loop:

for( i in 5:6) {
  print(summary(lm(data1$roa~data1$employees, data=subset(data1,data1[,i]))))
}

来源：https://stackoverflow.com/questions/52856152/linear-regression-with-conditional-statement-in-r

标签

conditional

regression

subset