Why reading rows is faster than reading columns?

后端 未结 2 1799
粉色の甜心
粉色の甜心 2021-02-04 12:19

I am analysing a dataset having 200 rows and 1200 columns, this dataset is stored in a .CSV file. In order to process, I read this file using R\'s read.csv()<

2条回答
  •  耶瑟儿~
    2021-02-04 12:31

    Wide data sets are typically slower to read into memory than long data sets (i.e. the transposed one). This effects many programs that read data, such as R, Python, Excel, etc. though this description is more pertinent to R:

    • R needs to allocate memory to each cell, even if it is NA. This means that every column has at least as many cells as the number of rows in the csv file, whereas in a long dataset you can potentially drop the NA values and save some space
    • R has to guess the data type for each value and make sure it's consistent with the data type of the column, which also introduces overhead

    Since your dataset doesn't appear to contain any NA values, my hunch is that you're seeing the speed improvement because of the second point. You can test this theory by passing colClasses = rep('numeric', 20) to read.csv or fread for the 20 column data set, or rep('numeric', 120) for the 120 column one, which should decrease the overhead of guessing data types.

提交回复
热议问题