What can R do about a messy data format?

前端 未结 6 1832
时光说笑
时光说笑 2020-11-28 08:27

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and

6条回答
  •  天涯浪人
    2020-11-28 09:19

    Using data.table::fread:

    x = '
    +------------+------+------+----------+--------------------------+
    |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
    +------------+------+------+----------+--------------------------+
    | 2018-06-01 | A    | A1   |        0 |                        0 |
    | 2018-06-03 | A    | A2   |        0 |                        1 |
    | 2018-06-03 | A    | A3   |        0 |                        2 |
    | 2018-06-03 | A    | A4   |        1 |                        1 |
    | 2018-06-03 | A    | A5   |        2 |                        1 |
    | 2018-06-04 | A    | A6   |        0 |                        3 |
    | 2018-06-01 | B    | B1   |        0 |                        1 |
    | 2018-06-02 | B    | B2   |        0 |                        2 |
    | 2018-06-03 | B    | B3   |        0 |                        3 |
    +------------+------+------+----------+--------------------------+
    '
    
    fread(gsub('\\+.+\\n' ,'', x, perl = T), drop=c(1,7))
    
    #          Date Emp1 Case Priority PriorityCountinLast7days
    # 1: 2018-06-01    A   A1        0                        0
    # 2: 2018-06-03    A   A2        0                        1
    # 3: 2018-06-03    A   A3        0                        2
    # 4: 2018-06-03    A   A4        1                        1
    # 5: 2018-06-03    A   A5        2                        1
    # 6: 2018-06-04    A   A6        0                        3
    # 7: 2018-06-01    B   B1        0                        1
    # 8: 2018-06-02    B   B2        0                        2
    # 9: 2018-06-03    B   B3        0                        3
    

    The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

提交回复
热议问题