Column referencing: [[i]] vs [,i] for matrix, dataframe, and data.table

倾然丶 夕夏残阳落幕 提交于 2020-01-13 06:32:00

问题


Could someone please explain to me the difference in column referencing between matrix, data.frame, and data.table? I'm getting my head around which syntax to use for each class, but I don't understand how/why they're different.

Take a 10x10 matrix

foo <- matrix( nrow = 10, ncol = 10 )

I'll just fill the 2nd column to demonstrate:

foo[,2] <- rnorm(10)
head( foo, 3 )

        [,1]       [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
  [1,]   NA -0.4688874   NA   NA   NA   NA   NA   NA   NA    NA
  [2,]   NA -1.0273370   NA   NA   NA   NA   NA   NA   NA    NA
  [3,]   NA -0.3981627   NA   NA   NA   NA   NA   NA   NA    NA

Now I can reference the 2nd column with foo[,2], but foo[[2]] returns only 1 cell, which in this case is NA:

foo[,2]
 [1]  0.18340527  0.46511236 -2.43277107  0.13260218  0.20227436 -0.57518392 -0.62211864  2.00239088  -0.09561907  0.67536428

foo[[2]]
  [1] NA

If I change the matrix to a dataframe, both referencing methods work:

foo <- data.frame( foo )
foo[,2]
  [1] -0.4688874 -1.0273370 -0.3981627 -0.2207062  0.5711004  1.1085851 -1.3343338  0.2337622  -1.0632469 -0.9783714


foo[[2]]
  [1] -0.4688874 -1.0273370 -0.3981627 -0.2207062  0.5711004  1.1085851 -1.3343338  0.2337622  -1.0632469 -0.9783714

Now if I convert to a data.table only the second method works, and the first method returns the value 2 (which isn't in the table at all):

foo[,2]
  [1] 2
foo[[2]]
  [1] -0.4688874 -1.0273370 -0.3981627 -0.2207062  0.5711004  1.1085851 -1.3343338  0.2337622  -1.0632469 -0.9783714

So my question is, why the different syntax for different classes? And is there a particular syntax that would work for all 3 classes, or do we need to know/check the tabular class before knowing how to call a reference?

EDIT: also interesting here is that row referencing is more consistent across classes.

For matrix, dataframe, and data.table respectively:

foo[2,]
 [1]        NA 0.4651124        NA        NA        NA        NA        NA        NA        NA        NA

foo <- data.frame( foo )
foo[2,]
  X1        X2 X3 X4 X5 X6 X7 X8 X9 X10
2 NA 0.4651124 NA NA NA NA NA NA NA  NA

setDT( foo )
foo[2,]
  X1        X2 X3 X4 X5 X6 X7 X8 X9 X10
1: NA 0.4651124 NA NA NA NA NA NA NA  NA

回答1:


A few things I've learned since posting this question here (with a lot of help from the commenters!), which have helped me to understand the differences I mentioned. If anyone could please clarify anything I'm getting wrong here, I'd appreciate it:

  • Objects of class matrix, data.frame, and data.table are all list objects under the hood, but they differ in an important way.

  • Each column of a data.frame or data.table object is an element of the list "under its hood", meaning that a column can be extracted in the same way as a list element would be, hence foo[[2]] works great for calling the second column in both of those classes.

  • A matrix differs in that every cell is an element of the list, meaning that foo[[2]] will only retrieve one cell, rather than a column (which brings us to...).

  • Those list items making up the matrix are "filled" column-wise (top-to-bottom, left-to-right), so the call foo[[2]] is retrieving the second item, which here resides in row 2 of column 1.

  • Since the matrix does also have dimensions, foo[,2] is accepted as referring to the second column, as it does for a data.frame object.

  • A data.table object (up until recently, see next point) didn't have a particularly logical response to the call foo[,2], and returned the value 2 regardless of the data to which it was referring, for no good reason I can find.

  • As of very recent updates to the data.table package (as of 1.9.8 I think? Thank you maintainers!) the syntax foo[,2] is now logically accepted as per a data.frame, so some of the confusion which lead to my question has been superseded!

So in conclusion:

  1. All of the objects I mentioned in my question are actually lists (which means I now get @N8TRO's joke in the comments, to which I was naively oblivious before), with both data.table and data.frame containing a list element for each column, and a matrix containing a list element for each cell (this makes the [[ call make sense to me now).

  2. All of the objects mentioned have dimensions, which means (as of recent data.table package updates) the foo[,2] syntax works the same for all 3 classes. YAY!

Thank you so much to the commenters for pointing me in the right direction (and making jokes that I now get). I hope this might help someone in the future who comes across the same confusion I did.



来源:https://stackoverflow.com/questions/38263734/column-referencing-i-vs-i-for-matrix-dataframe-and-data-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!