Reading in multiple CSVs with different numbers of lines to skip at start of file

问题

I have to read in about 300 individual CSVs. I have managed to automate the process using a loop and structured CSV names. However each CSV has 14-17 lines of rubbish at the start and it varies randomly so hard coding a 'skip' parameter in the read.table command won't work. The column names and number of columns is the same for each CSV.

Here is an example of what I am up against:

QUICK STATISTICS:

      Directory: Data,,,,
           File: Final_Comp_Zn_1
      Selection: SEL{Ox*1000+Doma=1201}
         Weight: None,,,
     ,,Variable: AG,,,

Total Number of Samples: 450212  Number of Selected Samples: 277


Statistics

VARIABLE,Min slice Y(m),Max slice Y(m),Count,Minimum,Maximum,Mean,Std.Dev.,Variance,Total Samples in Domain,Active Samples in Domain AG,  
6780.00,   6840.00,         7,    3.0000,   52.5000,   23.4143,   16.8507,  283.9469,        10,        10 AG,   
6840.00,   6900.00,         4,    4.0000,    5.5000,    4.9500,    0.5766,    0.3325,        13,        13 AG,   
6900.00,   6960.00,        16,    1.0000,   37.0000,    8.7625,    9.0047,   81.0848,        29,        29 AG,   
6960.00,   7020.00,        58,    3.0000,   73.5000,   10.6931,   11.9087,  141.8172,       132,       132 AG,   
7020.00,   7080.00,        23,    3.0000,  104.5000,   15.3435,   23.2233,  539.3207,        23,        23 AG,   
7080.00,   7140.00,        33,    1.0000,   15.4000,    3.8152,    2.8441,    8.0892,        35,        35 AG,

Basically I want to read from the line VARIABLE,Min slice Y(m),Max slice Y(m),.... I can think of a few solutions but I don't know how I would go about programming it. Is there anyway I can:

Read the CSV first and somehow work out how many lines of rubbish there is and then re-read it and specify the correct number of lines to skip? Or
Tell read.table to start reading when it finds the column names (since these are the same for each CSV) and ignore everything prior to that?

I think solution (2) would be the most appropriate, but I am open to any suggestions!

回答1:

The function fread from the package data.table does automatic detection of number of rows to be skipped. The function is in development stage currently.

Here is an example code:

require(data.table)

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

lapply(list.files(pattern = "myfile.*.csv"), fread)

回答2:

Here's a minimal example of one approach that can be taken.

First, let's make up some csv files similar to the ones you describe:

cat("blah\nblah\nblah\nVARIABLE,X1,X2\nA,1,2\n", file="myfile1.csv")
cat("blah\nVARIABLE,A1,A2\nA,1,2\n", file="myfile2.csv")
cat("blah\nblah\nVARIABLE,Z1,Z2\nA,1,2\n", file="myfile3.csv")

Second, identify where the data start:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x))-1)

Third, use that information to read in your files into a single list.

lapply(names(linesToSkip), 
       function(x) read.csv(file=x, skip = linesToSkip[x]))
# [[1]]
#   VARIABLE X1 X2
# 1        A  1  2
# 
# [[2]]
#   VARIABLE A1 A2
# 1        A  1  2
# 
# [[3]]
#   VARIABLE Z1 Z2
# 1        A  1  2

Edit #1

An alternative to reading the data twice is to read it once into a list, and then perform the same type of processing:

myRawData <- lapply(list.files(pattern = "myfile.*.csv"), readLines)
lapply(myRawData, function(x) {
  linesToSkip <- grep("^VARIABLE", x)-1
  read.csv(text = x, skip = linesToSkip)
})

Or, for that matter:

lapply(list.files(pattern = "myfile.*.csv"), function(x) {
  temp <- readLines(x)
  linesToSkip <- grep("^VARIABLE", temp)-1
  read.csv(text = temp, skip = linesToSkip)
})

Edit #2

As @PaulHiemstra notes, you can use the argument n to only read a few lines of each file into memory, rather than reading the whole file. Thus, if you know for certain that there aren't more than 20 lines of "rubbish" in each file, if you are using the first approach described, you can use:

linesToSkip <- sapply(list.files(pattern = "myfile.*.csv"), 
                      function(x) grep("^VARIABLE", readLines(x, n = 20))-1)

来源：https://stackoverflow.com/questions/15332195/reading-in-multiple-csvs-with-different-numbers-of-lines-to-skip-at-start-of-fil

标签

csv

read.table