Deleting specific rows from a data frame

后端 未结 3 1989
北荒
北荒 2020-12-03 12:49

I am working with some US govt data which has a lengthy list of cities and zip codes. After some work, the data is in the following format.

dat1 = data.frame         


        
相关标签:
3条回答
  • 2020-12-03 13:16

    It helps to store the data as characters, not factors:

    dat2 <- data.frame(keyword=c("Bremen", "Brent", "50143", "Chelsea, AL", 
                                 "Bailytown, Alabama", "52348", "54023", "54024"),   
                       tag=c(rep("AlabamCity",2), rep("AlabamaCityST",2), 
                             rep("AlabamaCityState",2), rep("AlabamaZipCode",2)),
                       stringsAsFactors = FALSE) ## note this bit
    

    Now we can convert keyword to numeric, and if it isn't a number in character format, we get an NA:

    want <- with(dat2, as.numeric(keyword))
    

    which gives us this:

    > (want <- with(dat2, as.numeric(keyword)))
    [1]    NA    NA 50143    NA    NA 52348 54023 54024
    Warning message:
    In eval(expr, envir, enclos) : NAs introduced by coercion
    

    You can ignore the warning or suppress it, but don't use this casually as it can mask problems:

    suppressWarnings(want <- with(dat2, as.numeric(keyword)))
    

    The final step is to select the elements of want that are not NA and have keyword equal to "AlabamaZipCode", which we do using &:

    (!is.na(want) & (dat2$tag != "AlabamaZipCode"))
    

    That selects the rows we don't want, so we need to negate the above, turning TRUE to FALSE and vice versa:

    !(!is.na(want) & (dat2$tag != "AlabamaZipCode"))
    

    Putting this together we have:

    dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
    

    which gives:

    > dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
                 keyword              tag
    1             Bremen       AlabamCity
    2              Brent       AlabamCity
    4        Chelsea, AL    AlabamaCityST
    5 Bailytown, Alabama AlabamaCityState
    7              54023   AlabamaZipCode
    8              54024   AlabamaZipCode
    

    Full solution is:

    want <- with(dat2, as.numeric(keyword))
    dat2[!(!is.na(want) & (dat2$tag != "AlabamaZipCode")), ]
    
    0 讨论(0)
  • 2020-12-03 13:17

    Here's one (slightly convoluted) approach that you may consider. First, create an identity column for each row. This will help with the subsetting. Secondly, create a vector of those id's that match your criteria. Finally, subset those id's out of your final data.

    The data you posted defaults to factors, not character data so I've accounted for that. If that's different than your actual data, you'll have to adjust accordingly. Also, when I converted the data to numeric, NAs are generated. A warning message is generated, but we can ignore that for this bit.

    #Generate an ID column
    dat4$id <- 1:nrow(dat4)
    
    #Create a vector of the id's that match your criteria'
    outliers <- dat4[as.character(dat4$tag) != "AlabamaZipCode" & !(is.na(as.numeric(as.character(dat4$keyword)))) , "id"]
    
    subset(dat4, !(id %in% outliers), select = 1:2)
                 keyword              tag
    1             Bremen       AlabamCity
    2              Brent       AlabamCity
    4        Chelsea, AL    AlabamaCityST
    5 Bailytown, Alabama AlabamaCityState
    7              54023   AlabamaZipCode
    8              54024   AlabamaZipCode
    

    Actually, you can shorten all of this to the following and avoid generating the id.

    dat4[!(as.character(dat4$tag) != "AlabamaZipCode" & !(is.na(as.numeric(as.character(dat4$keyword))))) , ]
    
    0 讨论(0)
  • 2020-12-03 13:20

    I think two grepl expressions should do the trick:

    > dat2[ !( grepl("City", dat2$tag) &  grepl("^\\d", dat2$keyword) ) , ]
                 keyword              tag
    1             Bremen       AlabamCity
    2              Brent       AlabamCity
    4        Chelsea, AL    AlabamaCityST
    5 Bailytown, Alabama AlabamaCityState
    7              54023   AlabamaZipCode
    8              54024   AlabamaZipCode
    

    You are eliminating the rows where there are digits in keyword and "City" in tag

    0 讨论(0)
提交回复
热议问题