How to skip extra lines before the header of a tab delimited delimited file in R

[亡魂溺海] 提交于 2019-12-06 06:58:05

问题


The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n")
  first.line <- min(grep("\\t", lines))
  return(read.delim(file.name, skip=first.line-1, ...))
}

However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?

Edited to add:

Marek suggested using a textConnection object. The way he suggested in the answer fails on a big file, but the following works:

read.parameters <- function(file.name, ...){
  conn = file(file.name, "r")
  on.exit(close(conn))
  repeat{
    line = readLines(conn, 1)
    if (length(grep("\\t", line))) {
      pushBack(line, conn)
      break}}
  df <- read.delim(conn, ...)
  return(df)}

Edited again: Thanks Marek for further improvement to the above function.


回答1:


You don't need to read twice. Use textConnection on first result.

read.parameters <- function(file.name, ...){
  lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
  first.line <- min(grep("\\t", lines))
  return(read.delim(textConnection(lines), skip=first.line-1, ...))
}



回答2:


If you can be sure that the header info won't be more than N lines, e.g. N = 200, then try:

scan(..., nlines = N)

That way you won't re-read more than N lines.



来源:https://stackoverflow.com/questions/3053095/how-to-skip-extra-lines-before-the-header-of-a-tab-delimited-delimited-file-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!