Force R to include 0 as a value in a regression of counts vs year

问题

Not sure whether this question would be better off at Cross Validated, but I think it is as much of a programming question as a pure statistical one.

I have a 102 x 1147 data frame where there are years (between 1960 and 2016) and each record is a scientific paper. I count the number of papers published each year within certain topics (guided by values in specific columns), and I want to calculate the linear slope from the year and the annual count of the number of papers.

Here's my script, first the linear model, then the plot:

# THEME 1 (POPABU)
sub2=subset(as.data.frame(table(sysrev60[,c("YR","POPABU")])),
        POPABU==1,select=c(1,3))
sub2$YR<-as.numeric(paste(sub2$YR))

lm_eqn <- function(df){
  m <- lm(Freq ~ YR, sub2);
  eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2, 
               list(a = format(coef(m)[1], digits = 2), 
                    b = format(coef(m)[2], digits = 2),
                    r2 = format(summary(m)$r.squared, digits = 3)))
  as.character(as.expression(eq));                 
}

ggplot(sub2, aes(x=YR,y=Freq)) + 
  scale_y_continuous(limit=c(0,20),expand=c(0, 0)) +
  scale_x_continuous(breaks=c(1960,1965,1970,1975,1980,1985,1990,1995,2000,
                          2005,2010,2015),labels=c(1960,1965,1970,1975,1980,1985,
                                                   1990,1995,2000,2005,2010,2015)) +
  geom_bar(stat='identity') + 
  geom_text(x = 1960, y = 16, label = lm_eqn(df), size=5,hjust=0, parse = TRUE) +
  stat_smooth(method="lm",col="red") +
  xlab(" ") + ylab("No of papers") +
  annotate("text",x=1960,y=18,label="THEME 1",
       family="serif",size=7,hjust=0,color="darkred")

My problem is that this procedure only calculates the linear relation between the year and the counts > 0. There are a number of years where the count of papers equals 0, and I need the regression to cover the same period (1960-2016) for all the 25 different topics I am studying, i.e. I need to force the regression to include a 0 for every year the count of papers is 0.

I've made subsets of the large data frame corresponding to each topic I want to study the publication rate for. Here's a DPUT of my 'sub2' data frame:

dput(sub2)
structure(list(YR = c(1960, 1961, 1962, 1963, 1964, 1965, 1966, 
1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 
1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 
1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 
2011, 2012, 2013, 2014, 2015, 2016), Freq = c(0L, 0L, 0L, 0L, 
0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 2L, 1L, 0L, 1L, 
3L, 0L, 1L, 0L, 2L, 0L, 3L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 2L, 
0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 2L, 0L, 1L, 
1L, 1L, 2L, 3L, 5L)), .Names = c("YR", "Freq"), row.names = 58:114, class = "data.frame")

As you can see there seem to be explicit 0's in my data frame, but the regression don't seem to care.

I have a feeling that this could be done by a small tweak of my script. How do I do that?

回答1:

What you have so far does take into account the zeros, which we can double check by manually calculating the coefficients in case you think lm() is doing something weird for some reason:

# Make sure zeros are there:
sub2$Freq
[1] 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 2 1 0 1 3 0 1 0 2 0 3 0 1 0 1 0 0 1 1 2 0 2
[39] 0 0 0 1 0 0 0 0 0 1 0 2 0 1 1 1 2 3 5
# Yep
X <- cbind(rep(1, nrow(sub2)), sub2$YR) # add a column of 1s for intercept
solve(t(X) %*% X) %*% t(X) %*% sub2$Freq # (X'X)^-1 X'Y -- OLS formula

            [,1]
[1,] -38.1778584
[2,]   0.0195748

Taking rounding into account, this matches what's displayed on the plot that results from your posted code:

When we use all the values, including the zeros, the intercept is about -38 and the year coefficient is about 0.02. So, there's absolutely nothing wrong there. What may be causing you to think that it's ignoring zeros is that there are no bars for the years where Freq is zero, but that's just because the plot is accurately reflecting the values -- when the height of the bar is zero, you will not be able to see a bar.

来源：https://stackoverflow.com/questions/47081712/force-r-to-include-0-as-a-value-in-a-regression-of-counts-vs-year

标签

linear-regression