In R, what exactly is the problem with having variables with the same name as base R functions?

后端 未结 7 1377
深忆病人
深忆病人 2020-11-30 03:48

It seems to be generally considered poor programming practise to use variable names that have functions in base R with the same name.

For example, it is tempting to

相关标签:
7条回答
  • 2020-11-30 04:09

    I agree with @Gavin Simpson and @Nick Sabbe that there is not really a problem, but that this is more a question of readability of code. Hence, as much things in life, it is a question of convention and consensus.

    And I think it is a good convention to give the general advice: Do not name your variables like base R functions!

    This advice works like other good advices. For example, we all know that we shall not drink too much booze and do not eat too much unhealthy food, but from time to time we cannot follow these advices and get drunk while eating too much junk food.

    The same is true for this advice. It does obviously make sense to name the data argument data. But it makes a lot less sense to name a data vector mean. Although there may be situations in which even this seems appropriate. But try to avoid those situations for clarity.

    0 讨论(0)
  • 2020-11-30 04:17

    There isn't really one. R will not normally search objects (non function objects) when looking for a function:

    > mean(1:10)
    [1] 5.5
    > mean <- 1
    > mean(1:10)
    [1] 5.5
    > rm(mean)
    > mean(1:10)
    [1] 5.5
    

    The examples shown by @Joris and @Sacha are where poor coding catches you out. One better way to write foo is:

    foo <- function(x, fun) {
        fun <- match.fun(fun)
        fun(x)
    }
    

    Which when used gives:

    > foo(1:10, mean)
    [1] 5.5
    > mean <- 1
    > foo(1:10, mean)
    [1] 5.5
    

    There are situations where this will catch you out, and @Joris's example with na.omit is one, which IIRC, is happening because of the standard, non-standard evaluation used in lm().

    Several Answers have also conflated the T vs TRUE issue with the masking of functions issue. As T and TRUE are not functions that is a little outside the scope of @Andrie's Question.

    0 讨论(0)
  • 2020-11-30 04:19

    The answer is simple. Well, kind of.

    The bottom line is that you should avoid confusion. Technically there is no reason to give your variables proper names, but it makes your code easier to read.

    Imagine having a line of code containing something like data()[1] or similar (this line probably doesn't make sense, but it's only an example): although it is clear to you now that you're using function data here, a reader who noticed there being a data.frame named data there, may be confused.

    And if you're not altruisticly inclined, remember that the reader could be you in half a year, trying to figure out what you were doing with 'that old code'.

    Take it from a man who has learned to use long variable names and naming conventions: it pays back!

    0 讨论(0)
  • 2020-11-30 04:20

    The problem is not so much the computer, but the user. In general, code can become a lot harder to debug. Typos are made very easily, so if you do :

    c <- c("Some text", "Second", "Third")
    c[3]
    c(3)
    

    You get the correct results. But if you miss somewhere in a code and type c(3) instead of c[3], finding the error will not be that easy.

    The scoping can also lead to very confusing error reports. Take following flawed function :

    my.foo <- function(x){
        if(x) c <- 1
        c + 1
    }
    
    > my.foo(TRUE)
    [1] 2
    > my.foo(FALSE)
    Error in c + 1 : non-numeric argument to binary operator
    

    With more complex functions, this can lead you on a debugging trail leading nowhere. If you replace c with x in the above function, the error will read "object 'x' not found". That will lead a lot faster to your coding error.

    Next to that, it can lead to rather confusing code. Code like c(c+c(a,b,c)) asks more from the brain than c(d+c(a,b,d)). Again, this is a trivial example, but it can make a difference.

    And obviously, you can get errors too. When you expect a function, you won't get it, which can give rise to another set of annoying bugs :

    my.foo <- function(x,fun) fun(x)
    my.foo(1,sum)
    [1] 1
    my.foo(1,c)
    Error in my.foo(1, c) : could not find function "fun"
    

    A more realistic (and real-life) example of how this can cause trouble :

    x <- c(1:10,NA)
    y <- c(NA,1:10)
    lm(x~y,na.action=na.omit)
    # ... correct output ...
    na.omit <- TRUE
    lm(x~y,na.action=na.omit)
    Error in model.frame.default(formula = x ~ y, na.action = na.omit, 
    drop.unused.levels = TRUE) : attempt to apply non-function
    

    Try figuring out what's wrong here if na.omit <- TRUE occurs 50 lines up in your code...

    Answer edited after comment of @Andrie to include the example of confusing error reports

    0 讨论(0)
  • 2020-11-30 04:26

    I think the problem is when people use these functions in global environment and can cause frustration due to some unexpected error you should not be getting. Imagine you just ran a reproducible example (maybe pretty lengthy one) that overwrote one of the function you're using in your simulation that takes ages to get to where you want it and then suddenly it breaks down with a funny error. Using already existing function names for variables in a closed environment (like a function) are removed after the function closes and should not cause harm. Assuming the programmer is aware of all the consequences of such behavior.

    0 讨论(0)
  • 2020-11-30 04:28

    R is very robust to this, but you can think of ways to break it. For example, consider this funcion:

    foo <- function(x,fun) fun(x)
    

    Which simply applies fun to x. Not the prettiest way to do this but you might encounter this from someones script or so. This works for mean():

    > foo(1:10,mean)
    [1] 5.5
    

    But if I assign a new value to mean it breaks:

    mean <- 1
    foo(1:10,mean)
    
    Error in foo(1:10, mean) : could not find function "fun"
    

    This will happen very rarely, but it might happen. It is also very confusing for people if the same thing means two things:

    mean(mean)
    

    Since it is trivial to use any other name you want, why not use a different name then base R functions? Also, for some R variables this becomes even more important. Think of reassigning the '+' function! Another good example is reassignment of T and F which can break so much scripts.

    0 讨论(0)
提交回复
热议问题