Should I use a data.frame or a matrix?

前端 未结 6 1578
我在风中等你
我在风中等你 2020-11-28 00:59

When should one use a data.frame, and when is it better to use a matrix?

Both keep data in a rectangular format, so sometimes it\'s unclear

6条回答
  •  再見小時候
    2020-11-28 01:30

    I cannot stress out more the efficiency difference between the two! While it is true that DFs are more convenient in some especially data analysis cases, they also allow heterogeneous data, and some libraries accept them only, these all is really secondary unless you write a one-time code for a specific task.

    Let me give you an example. There was a function that would calculate the 2D path of the MCMC method. Basically, this means we take an initial point (x,y), and iterate a certain algorithm to find a new point (x,y) at each step, constructing this way the whole path. The algorithm involves calculating a quite complex function and the generation of some random variable at each iteration, so when it run for 12 seconds I thought it is fine given how much stuff it does at each step. That being said, the function collected all points in the constructed path together with the value of an objective function in a 3-column data.frame. So, 3 columns is not that large, and the number of steps was also more than reasonable 10,000 (in this kind of problems paths of length 1,000,000 are typical, so 10,000 is nothing). So, I thought a DF 10,000x3 is definitely not an issue. The reason a DF was used is simple. After calling the function, ggplot() was called to draw the resulting (x,y)-path. And ggplot() does not accept a matrix.

    Then, at some point out of curiosity I decided to change the function to collect the path in a matrix. Gladly the syntax of DFs and matrices is similar, all I did was to change the line specifying df as a data.frame to one initializing it as a matrix. Here I need also to mention that in the initial code the DF was initialized to have the final size, so later in the code of the function only new values were recorded into already allocated spaces, and there was no overhead of adding new rows to the DF. This makes the comparison even more fair, and it also made my job simpler as I did not need to rewrite anything further in the function. Just one line change from the initial allocation of a data.frame of the required size to a matrix of the same size. To adapt the new version of the function to ggplot(), I converted the now returned matrix to a data.frame to use in ggplot().

    After I rerun the code I could not believe the result. The code run in a fraction of a second! Instead of about 12 seconds. And again, the function during the 10,000 iterations only read and wrote values to already allocated spaces in a DF (and now in a matrix). And this difference is also for the reasonable (or rather small) size 10000x3.

    So, if your only reason to use a DF is to make it compatible with a library function such as ggplot(), you can always convert it to a DF at the last moment -- work with matrices as far as you feel convenient. If on the other hand there is a more substantial reason to use a DF, such as you use some data analysis package that would require otherwise constant transforming from matrices to DFs and back, or you do not do any intensive calculations yourself and only use standard packages (many of them actually internally transform a DF to a matrix, do their job, and then transform the result back -- so they do all efficiency work for you), or do a one-time job so you do not care and feel more comfortable with DFs, then you should not worry about efficiency.

    Or a different more practical rule: if you have a question such as in the OP, use matrices, so you would use DFs only when you do not have such a question (because you already know you have to use DFs, or because you do not really care as the code is one-time etc.).

    But in general keep this efficiency point always in mind as a priority.

提交回复
热议问题