What can R do about a messy data format?

前端 未结 6 1826
时光说笑
时光说笑 2020-11-28 08:27

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and

6条回答
  •  长情又很酷
    2020-11-28 08:54

    md_table <- scan(text = "
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+",
    what = "", sep = "", comment.char = "+", quiet = TRUE)
    
    ## it is clear that there are 5 columns
    mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
    #      [,1]         [,2]   [,3]   [,4]       [,5]                      
    # [1,] "Date"       "Emp1" "Case" "Priority" "PriorityCountinLast7days"
    # [2,] "2018-06-01" "A"    "A1"   "0"        "0"                       
    # [3,] "2018-06-03" "A"    "A2"   "0"        "1"                       
    # [4,] "2018-06-03" "A"    "A3"   "0"        "2"                       
    # [5,] "2018-06-03" "A"    "A4"   "1"        "1"                       
    # [6,] "2018-06-03" "A"    "A5"   "2"        "1"                       
    # [7,] "2018-06-04" "A"    "A6"   "0"        "3"                       
    # [8,] "2018-06-01" "B"    "B1"   "0"        "1"                       
    # [9,] "2018-06-02" "B"    "B2"   "0"        "2"                       
    #[10,] "2018-06-03" "B"    "B3"   "0"        "3"
    

    ## a data frame with all character columns
    dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
    #        Date Emp1 Case Priority PriorityCountinLast7days
    #1 2018-06-01    A   A1        0                        0
    #2 2018-06-03    A   A2        0                        1
    #3 2018-06-03    A   A3        0                        2
    #4 2018-06-03    A   A4        1                        1
    #5 2018-06-03    A   A5        2                        1
    #6 2018-06-04    A   A6        0                        3
    #7 2018-06-01    B   B1        0                        1
    #8 2018-06-02    B   B2        0                        2
    #9 2018-06-03    B   B3        0                        3
    

    ## or maybe just use `type.convert` on some columns?
    dat[] <- lapply(dat, type.convert)
    

提交回复
热议问题