Trying to use user-defined function to populate new column in dataframe. What is going wrong?

扶醉桌前 提交于 2019-12-06 03:47:16

The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.

Also, your example for EmployeeLocationNumber does not match your description.

> EmployeeLocationNumber(8)
[1] 3

Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()

TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)

which gives

> TestDF
  Employee Month Location ELN
1        1     1        1   1
2        1     5        5   2
3        1     6        6   3
4        1    11        7   4
5        2     4       10   1
6        2    10        3   2
7        3     1        4   1
8        3     5        2   2
9        3    10        8   3

As to your other questions, I would just write it as

TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)

The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).

Using logical indexing, the condensed one-liner replacement for your function is:

EmployeeLocationNumber <- function(Site){
    with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}

Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.

A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.

B) In what sense is Location:8 the "second location visited"?

C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.

D) Conditional access of a data.frame typically involves logical indexing and or the use of which()

If you just want the sequence of visits by employee try this: (Changed first argument to Month since that is what determines the sequence of locations)

 with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
 TestDF$LocOrder <-  with(TestDF, ave(Month, Employee, FUN=seq))

If you wanted the second location for EE:3 it would be:

subset(TestDF, LocOrder==2 & Employee==3, select= Location)
#   Location
# 8        2

Your EmployeeLocationNumber function takes a vector in and returns a single value. The assignment to create a new data.frame column therefore just gets a single value:

EmployeeLocationNumber(TestDF$Location) # returns 1

TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
  1. Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
  2. I'll get back to you on that :)
  3. Dito.

Update: I finally worked out some code to do it, but by then @DWin has a much better solution :(

TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))

...I guess the ave function does pretty much what the code above does. But for the record:

First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.

Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":

This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:

TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!