replace NA in an R vector with adjacent values

不问归期 提交于 2020-01-07 08:24:13

问题


I have a dataframe which has merged player and team data for soccer seasons So for a particular player in a specific season I have data like

df <- data.frame(team=c(NA,"CRP",NA,"CRP","CRP",NA),
             player=c(NA,"Ed",NA,"Ed","Ed",NA),
             playerGame= c(NA,1,NA,2,3,NA),
             teamGame =c(1,2,3,4,5,6)) 

Where the NA's indicate that the player did not appear in that specific team game

How would I most efficiently replace the team and player NA's with "CRP" and "Ed" respectively and have a plGame output of, in this instance, 0,1,1,2,3,3


EDIT

Sorry, I wrote this when I woke up in the middle of the night and may have over-simplified my problem too much. Only one person seems to have picked up on the fact that this is a subset of a much larger set of data and even he/she did not follow it though that a straight hardcode replacement of player and team was insufficient Thanks for the replies. Dsee's hint for the na.locf in the zoo package and the first line of AK's answer appears to offer the best way forward

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
na.locf(df$playerGame)

This covers the eventuality of more than one NA to start the sequence. In my case the min(df$teamGame) will always be 1 so hardcoding that may speed things up

A more realistic example is here

library(zoo)
library(plyr)

newdf <- data.frame(team=c("CRP","CRP","CRP","CRP","CRP","CRP","TOT","TOT","TOT"),
             player=c(NA,"Ed",NA,"Bill","Bill",NA,NA,NA,"Tom"),
             playerGame= c(NA,1,NA,1,2,NA,NA,NA,1),
             teamGame =c(1,2,3,1,2,3,1,2,3))

I can now show the team for every row Each team plays three games in a season. Ed and Bill, play for CRP and appear in games 2 and 1,2 respectively. Tom plays for TOT in game 3 only. Assume that player names are unique(even in real world data)

It seems to me that I need to create another column, 'playerTeam'

newdf$playerTeam <- 0

for (i in 1:nrow(newdf)) {
newdf$playerTeam[i] <-ceiling(i/3)
}

I can then use this value to fill in the player gaps. I have used the sort functiom which omits NA

newdf <- ddply(newdf,.(playerTeam),transform,player=sort(player)[1])

I can then use the aforementioned code

newdf$playerGame[newdf$teamGame == 1 & is.na(newdf$playerGame) == TRUE] <- 0
newdf$playerGame <- na.locf(newdf$playerGame)

   team player playerGame teamGame playerTeam
1  CRP     Ed          0        1          1
2  CRP     Ed          1        2          1
3  CRP     Ed          1        3          1
4  CRP   Bill          1        1          2
5  CRP   Bill          2        2          2
6  CRP   Bill          2        3          2
7  TOT    Tom          0        1          3
8  TOT    Tom          0        2          3
9  TOT    Tom          1        3          3

I will need to build in season as well but that should not be a problem

Am I missing anything here?

I have several hundred thousand rows to process so any speed ups would be helpful. For instance I would probably want to avoid ddply and use a data.table approach or another apply function, right


回答1:


There seem to be 2 parts to what you want:

  1. You want to replace the player name & team with pre-determined values
  2. You wish to carry forward the counts of the games through the list of playerGame

For (1), you could do:

df$team[is.na(df$team)] <- 'CRP' 

Similarly you can alter the other component of the dataframe

For (2) you could do this:

if(is.na(df$playerGame[1])) {
    df$playerGame[1] <- 0
}
for(i in 2:length(df$playerGame)) { 
    if(is.na(x[i])) {
        df$playerGame[i] <- df$playerGame[i-1]
    }
} 

then df$playerGame is:

[1] 0 1 1 2 3 3

Perhaps there is a very nifty way to do this, but this is clearly readable...




回答2:


to select the NA's, for say player, use

  df$player[is.na(df$player)]

Then to assign a value to these use

  df$player[is.na(df$player)]  <- "Ed"

If you just want to assign the entire player column the same name, you do not need to select any values:

  df$player[]  <-  "Ed"   # you can omit the brackets [], which are shown just for emphasis

you can then do the same for df$team


On a side note, when you create your data frame, if you plan to add values other than those already there, you will likely want to add stringsAsFactors=FALSE
 data.frame( . , stringsAsFactors=FALSE)



回答3:


replace NA's for team and player:

df$team[is.na(df$team)] <- "CRP"
df$player[is.na(df$player)] <- "Ed"

There is probably a more efficient way to get the adjacent values, but is works.

If the first or/and the last value is NA, as in your example, I had to use two additional lines:

df$playerGame[df$teamGame == min(df$teamGame) & is.na(df$playerGame) == TRUE] <- 0
df$playerGame[df$teamGame == max(df$teamGame) & is.na(df$playerGame) == TRUE] <- max(df$playerGame, na.rm = TRUE)

For all other observations, this get the adjacent values:

df$playerGame[is.na(df$playerGame) == TRUE] <- df$playerGame[-1]

df

team player playerGame teamGame
CRP     Ed          0        1
CRP     Ed          1        2
CRP     Ed          1        3
CRP     Ed          2        4
CRP     Ed          3        5
CRP     Ed          3        6

For more than one team and/or player I would suggest to combine it with ddply (plyr).



来源:https://stackoverflow.com/questions/13514266/replace-na-in-an-r-vector-with-adjacent-values

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!