Reading in multiple CSVs with different numbers of lines to skip at start of file

怎甘沉沦 提交于 2019-12-03 05:58:59

The function fread from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.

Here is an example code:

require(data.table)

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

lapply(list.files(pattern = "myfile.*.csv"), fread)

Here's a minimal example of one approach that can be taken.

First, let's make up some csv files similar to the ones you describe:

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

Second, identify where the data start:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x))-1)

Third, use that information to read in your files into a single list.

lapply(names(linesToSkip), 
       function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
#   VARIABLE X1 X2
# 1        A  1  2
# 
# [[2]]
#   VARIABLE A1 A2
# 1        A  1  2
# 
# [[3]]
#   VARIABLE Z1 Z2
# 1        A  1  2

Edit #1

An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:

myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
  linesToSkip <- grep("^VARIABLE", x)-1
  read.csv(text = x, skip = linesToSkip)
})

Or, for that matter:

lapply(list.files(pattern = "myfile.*.csv"), function(x) {
  temp <- readLines(x)
  linesToSkip <- grep("^VARIABLE", temp)-1
  read.csv(text = temp, skip = linesToSkip)
})

Edit #2

As @PaulHiemstra notes, you can use the argument n to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x, n = 20))-1)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!