How can I convert a matrix of strings into a tibble?

北城以北 提交于 2019-12-11 06:28:54

问题


I'm given some data in an .rData file. The format is an xts object in character mode. (I realise this is an unusual format, but I have no control over it)

> head(trades)
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR   
2012-05-04 09:30:00 "BAC"  "T" "7.89"   "38538" "F"  "7.89" "523"    "7.9" 
2012-05-04 09:30:01 "BAC"  "Z" "7.885"  "288"   "@"  "7.88" "61033"  "7.9" 
2012-05-04 09:30:03 "BAC"  "X" "7.89"   "1000"  "@"  "7.88" "1974"   "7.89"
2012-05-04 09:30:07 "BAC"  "T" "7.89"   "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.89"   "85053" "F"  "7.88" "108101" "7.9" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.9" 

> mode(trades)
'character'

I would like to process this data by converting into a saner format, namely a tibble, so that I can store the columns as datetimes, doubles and integers.

I've managed to achieve this with the following code:

> trades_ =  bind_cols(data_frame(DATE=index(trades)), as_data_frame(coredata(trades))) %>%
    mutate_at(as.numeric, .cols=vars(PRICE, BID, OFR)) %>%
    mutate_at(as.integer, .cols=vars(SIZE, BIDSIZ, OFRSIZ))

> head(trades_)
# A tibble: 6 × 10
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90   
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89   
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89   
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90

I'm wondering if there's already a built-in function for this. Something that looks at each column of the trades matrix and figures out whether it's a column of integers, doubles, etc and converts it to the appropriate type.

This is the sort of thing that a csv parser would do.


回答1:


This is far from an authoritative answer, but I ended up doing this:

smarter_type_convert = function (vector) {
    converted_vector = type.convert(vector)
    if (is.numeric(converted_vector)) {
        int_vector = as.integer(converted_vector)
        if (isTRUE(all.equal(int_vector, converted_vector, check.attributes=FALSE))) {
            int_vector
        } else {
            converted_vector
        }
    } else {
        converted_vector
    }
}

trades %>% coredata %>% as_data_frame %>% mutate_all(smarter_type_convert)



回答2:


You're right that a data frame is the right approach since you are working with multiple classes in the same frame. xts doesn't allow multiple classes so coercion rules force you to work with characters instead of numbers.

Here's a solution so you don't have to specifically call out each column. I use the tidyquant package which is made for working with quantitative data within the "tidyverse" (i.e. using "tidy" data frames). It also has some nice functions to convert to and from xts, matrix and other time series classes that contain row names.

First, I recreate the data.

> trades_xts
                    SYMBOL EX  PRICE    SIZE    COND BID    BIDSIZ   OFR    
2012-05-04 09:30:00 "BAC"  "T" "7.8900" "38538" "F"  "7.89" "523"    "7.90" 
2012-05-04 09:30:01 "BAC"  "Z" "7.8850" "288"   "@"  "7.88" "61033"  "7.90" 
2012-05-04 09:30:03 "BAC"  "X" "7.8900" "1000"  "@"  "7.88" "1974"   "7.89" 
2012-05-04 09:30:07 "BAC"  "T" "7.8900" "19052" "F"  "7.88" "1058"   "7.89"
2012-05-04 09:30:08 "BAC"  "Y" "7.8900" "85053" "F"  "7.88" "108101" "7.90" 
2012-05-04 09:30:09 "BAC"  "D" "7.8901" "10219" "@"  "7.89" "268"    "7.90" 

Next, I use tidy functions to clean the data. It's a little longer than your script, but you don't need to worry about which columns have which data type (with the exception of the xts index). Note that I'm using the tidyquant::as_tibble() functions to convert the xts row names to a column. I use mutate_each to apply the type.convert function to each column. Unfortunately base R loves the factor class so I add an additional step to convert to character. The last two steps just clean up the date-time column using dplyr::rename and lubridate::as_datetime, which tidyquant loads for you.

> library(tidyquant)
> trades_xts %>%
    as_tibble(preserve_row_names = TRUE) %>%
    mutate_each(funs(type.convert)) %>%
    mutate_if(is.factor, as.character) %>%
    rename(DATE = row.names) %>%
    mutate(DATE = as_datetime(DATE, tz = Sys.timezone())) 
    # A tibble: 6 × 9
                 DATE SYMBOL    EX  PRICE  SIZE  COND   BID BIDSIZ   OFR
               <dttm>  <chr> <chr>  <dbl> <int> <chr> <dbl>  <int> <dbl>
1 2012-05-04 09:30:00    BAC     T 7.8900 38538     F  7.89    523  7.90
2 2012-05-04 09:30:01    BAC     Z 7.8850   288     @  7.88  61033  7.90
3 2012-05-04 09:30:03    BAC     X 7.8900  1000     @  7.88   1974  7.89
4 2012-05-04 09:30:07    BAC     T 7.8900 19052     F  7.88   1058  7.89
5 2012-05-04 09:30:08    BAC     Y 7.8900 85053     F  7.88 108101  7.90
6 2012-05-04 09:30:09    BAC     D 7.8901 10219     @  7.89    268  7.90


来源:https://stackoverflow.com/questions/42475874/how-can-i-convert-a-matrix-of-strings-into-a-tibble

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!