How to read data when some numbers contain commas as thousand separator?

后端 未结 11 1531
情书的邮戳
情书的邮戳 2020-11-22 02:29

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. \"1,513\" instead of 1513. Wh

11条回答
  •  傲寒
    傲寒 (楼主)
    2020-11-22 02:45

    This question is several years old, but I stumbled upon it, which means maybe others will.

    The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

    library(readr)
    read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
              col_types = list(col_numeric())
            )
    

    This yields

    Source: local data frame [4 x 1]

      numbers
        (dbl)
    1   800.0
    2  1800.0
    3  3500.0
    4     6.5
    

    An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

    For instance, if I had not flagged the col_types, I would have gotten this:

    > read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
    Source: local data frame [4 x 1]
    
      numbers
        (chr)
    1     800
    2   1,800
    3    3500
    4     6.5
    

    (Notice that it is now a chr (character) instead of a numeric.)

    Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

    > set.seed(1)
    > tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
    > tmp <- c(tmp, "1,003")
    > tmp <- paste(tmp, collapse="\"\n\"")
    

    (such that the last few elements look like:)

    \"5\"\n\"9\"\n\"7\"\n\"1,003"
    

    Then you'll find trouble reading that comma at all!

    > tail(read_csv(tmp))
    Source: local data frame [6 x 1]
    
         3"
      (dbl)
    1 8.000
    2 5.000
    3 5.000
    4 9.000
    5 7.000
    6 1.003
    Warning message:
    1 problems parsing literal data. See problems(...) for more details. 
    

提交回复
热议问题