typeof returns integer for something that is clearly a factor

后端 未结 2 485
悲哀的现实
悲哀的现实 2020-12-03 14:51

Create a variable:

a_variable <- c(\"a\",\"b\",\"c\")

Check type:

typeof(a_variable)

I want a factor -

相关标签:
2条回答
  • 2020-12-03 15:30

    More on str - the surprising information for me was it's an abbreviation of "structure" not "string". It can be clearly seen in the bottommost example how str command is capturing it subjectively clearer than dput, naming it “Factor w/ N levels”:

    str(head(abalone$Age, 5)) Factor w/ 3 levels "Mid","Old","Yng": 2 3 1 1 3

    Thank you for asking this question, as I've found data types in R confusing and ran into the same issue while processing the Abalone dataset from UCI Machine Learning Repository. I've continued on with the research following the reply by 42-. It eventually helped me understand the typing and hopefully could help someone else. I found this resource helpful on understanding R data types: R-supp-data-structures

    What I've observed while processing the data.frame from Abalon dataset:

    1. running lapply function on the "Age" column of the data.frame is resulting in a "list" of "character" type objects - due to the lapply property always returning a list even if in this case it could be an atomic vector
    2. further applying unlist function on the "Age" column of the data.frame is resulting in an "atomic vector" of "character" type object
    3. afer encoding vector as a factor we get a "factor" class object

    The code example:

    #
    # Understanding datatypes while processing Abalone dataset  
    #
    download.file('http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', 'abalone.data')
    abalone  = read.table("abalone.data", header = FALSE, sep=",", na.strings= "*")
    
    # name columns of a data.frame object
    colnames(abalone) <- c('Sex', 'Length','Diameter','Height','Whole w.', 'Shucked w.', 'Viscera w.','Shell w.','Rings')
    dput(head(abalone, 1))
    
    # discretize numeric rings to three ranges of an abalone age
    additiveRingsToAgeConst = 1.5;
    
    abalone$Age = lapply(abalone[,'Rings'] + additiveRingsToAgeConst, function (x) {
      if (x > 11.5)     {"Old"}
      else if (x > 9.5) {"Mid"}
      else              {"Yng"}   
    })
    
    # 1. running lapply function on the "Age" column of the data.frame is resulting in a "list" of "character" type objects
    dput(head(abalone$Age, 5))
    str(head(abalone$Age, 5))
    # 2. further applying unlist function on the "Age" column of the data.frame is resulting in an "atomic vector" of "character" type object
    abalone$Age = unlist(abalone$Age);
    dput(head(abalone$Age, 5))
    str(head(abalone$Age, 5))
    # 3. afer encoding vector as a factor we get a "factor" class object
    abalone$Age = as.factor(abalone$Age)
    dput(head(abalone$Age, 5))
    str(head(abalone$Age, 5))
    

    Code execution results:

    > # 1. running lapply function on the "Age" column of 
      #    the data.frame is resulting in a "list" of "character" type objects
    > dput(head(abalone$Age, 5))
    list("Old", "Yng", "Mid", "Mid", "Yng")
    > str(head(abalone$Age, 5))
    List of 5
     $ : chr "Old"
     $ : chr "Yng"
     $ : chr "Mid"
     $ : chr "Mid"
     $ : chr "Yng"
    
    > # 2. further applying unlist function on the "Age" column of the data.frame 
      #    is resulting in an "atomic vector" of "character" type object
    > abalone$Age = unlist(abalone$Age);
    > dput(head(abalone$Age, 5))
    c("Old", "Yng", "Mid", "Mid", "Yng")
    > str(head(abalone$Age, 5))
     chr [1:5] "Old" "Yng" "Mid" "Mid" "Yng"
    
    > # 3. afer encoding vector as a factor we get a "factor" class object
    > abalone$Age = as.factor(abalone$Age)
    > dput(head(abalone$Age, 5))
    structure(c(2L, 3L, 1L, 1L, 3L), .Label = c("Mid", "Old", "Yng"
    ), class = "factor")
    > str(head(abalone$Age, 5))
     Factor w/ 3 levels "Mid","Old","Yng": 2 3 1 1 3
    
    0 讨论(0)
  • 2020-12-03 15:53

    This is a language feature that confused me as well in my early days of R programming. The typeof function is giving information that's at a "lower" level of abstraction. Factor variables (and also Dates) are stored as integers. Learn to use class or str rather than typeof (or mode). They give more useful information. You can look at the full "structure" of a factor variable with dput:

     dput( factor( rep( letters[1:5], 2) ) )
    # structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L), 
                .Label = c("a", "b", "c", "d", "e"), class = "factor")
    

    The character values that are usually thought of as the factor values are actually stored in an attribute (which is what "levels" returns), while the "main" part of the variable is a set of integer indices pointing to teh various level "attributes), named .Label, so mode returns "numeric" and typeof returns "integer". For this reason one usually needs to use as.character that will coerce to what most people think of as factors, namely their character representations.

    0 讨论(0)
提交回复
热议问题