Group together levels with similar names R

后端 未结 2 1839
既然无缘
既然无缘 2020-12-19 15:23

I have a variable q with various levels. Some of the levels are actually the same but have been bad reported.

 length(q)
[1] 13490
> levels(q)
  [1] \"         


        
相关标签:
2条回答
  • 2020-12-19 15:38

    You can use the function agrep, which searches for approximate matches. It uses the Levenshtein distance and you can maximum distance allowed for a match by means of the argument max.distance.

    Taking this vector (the one that you posted except the empty string "" and "KOMMER EJ IH\xc5G"):

    x <- c("Activelle", "CERACETTE", "cerazette", "CERAZETTE", "CEVAZETTE", 
    "Cilest", "Conludag", "DEPO...", "Depo. Pro Vera", "DEPO PROVERA", 
    "DEPROVERA", "desorelle", "Diane mite", "ENDEVINA", "ETHISYLESTRA,LEVONORGESTR", 
    "EXCLUTENA", "EXLUENTA 0,5MG", "Femanest", "gastonette", "hormon", 
    "IMPLANON", "LEMINOVA", "LENONOVA", "lenova", "LENOVA", "Leonova", 
    "LEVENOVA", "Levinova", "LEVIONOVA", "levonova", "LEVONOVA", 
    "Levonova lykkja", "lindynette", "loette", "malonetta", "Meniva", 
    "Mereilom", "Microgyn", "Microgynon", "Milvane", "MINI P", "Mini-pe", 
    "MINIRA", "minulet", "minulet p-piller", "Mircne", "Mirena", 
    "mirena levonorge", "Modina p-piller", "NEOULETTA", "NORLEVO", 
    "Novaring", "Novynette", "Nuva ring", "Østradiol dlf 2", "P-plaster", 
    "RESTOVAR", "Spiral", "T-GYN", "TRIMORDIOL", "TRIONETTA 28", 
    "T-spiral", "VET EJ", "yasmin", "YASMINELL", "Yasminelle", "ZYRONA", 
    "CERACETT", "CERASETTE", "Cerazette", "CERAZETTI", "cilest", 
    "Cileste", "COPALETTA?", "Depo-Provera", "DEPOPROVERA", "depoprovin", 
    "DESOLETT", "Diane", "Divana", "Estradot", "EXKLUTENA", "EXLUTENA", 
    "femenest", "Harmonet", "Hormonspiral", "INPLANON", "LEBONOVA", 
    "lemonora", "LENOR", "Lenova", "LENOVA?", "Levanova", "LEVINA", 
    "LEVINOVA", "Levnova", "Levonova", "Levonova hormonspiral", "Lindinette", 
    "Lindynette", "lyndynette", "Marvelon", "Mercilon", "merivan", 
    "microgynon", "Mikrogyn", "MINERVA/LEVONORG.", "MINI-P", "mini-pl", 
    "MINNS EJ", "Minulet", "MIRANDA", "mirena", "MIRENA", "MIRENA LEVONORGESTREL", 
    "Mod turner: milv", "NEOVLETTA", "novynette", "nuva ring", "NUVARING", 
    "Østradiolgel", "PROVERA", "spiral", "Synfase", "triminetta sando", 
    "TRINOVUM", "TRIREGOL", "Vagifem", "yas, bayer", "Yasmin", "yasminelle")
    

    You can do:

    groups <- list()
    i <- 1
    while(length(x) > 0)
    {
      id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
      groups[[i]] <- x[id]
      x <- x[-id]
      i <- i + 1
    }
    

    The first groups are defined as follows:

    head(groups)
    [[1]]
    [1] "Activelle"
    
    [[2]]
    [1] "CERACETTE" "cerazette" "CERAZETTE" "CERACETT"  "CERASETTE" "Cerazette"
    
    [[3]]
    [1] "CEVAZETTE"
    
    [[4]]
    [1] "Cilest"  "cilest"  "Cileste"
    
    [[5]]
    [1] "Conludag"
    
    [[6]]
    [1] "DEPO..."
    

    Be aware that the above code removes the elements in x. When the loop is finished the vector x will be empty.

    0 讨论(0)
  • 2020-12-19 15:49

    One solution could be to use grep and/or grepl:

    x <- c("toto", "CERACETT","CERASETTE","Cerazette","CERAZETTE","CEVAZETTE", "youpi")
    grep("ce[vr]a[z]ett[e]", x, ignore.case = TRUE, value = TRUE)
    x[grepl("ce[vr]a[sz]ett[e]", x, ignore.case = TRUE)] <- "replacement_string"
    
    0 讨论(0)
提交回复
热议问题