R: how to find select files in a folder based on matching specific column title

问题

Sorry for the generic question. I'm looking for pointers for sorting out a data folder, in which I have numerous .txt files. All of them have different titles, and for the vast majority of them, the files have the same dimension, that is the column numbers are the same. However, the pain is some of the files, despite having the same number of columns, have different column names. That is in those files, some other variables were measured.

I want to weed out these files, and I cannot do by simply comparing column numbers. Is there any method that I can pass a name of the column and check how many files in the directory have that column, so that I can remove them into a different folder?

UPDATE:

I have created a dummy folder to have files to reflect the problem please see link below to access the files on my google drive. In this folder, I have took 4 files that have the problem columns.

https://drive.google.com/drive/folders/1IDq7BwfQNkGb9y3RvwlLE3FeMQc38taD?usp=sharing

The problems is the code seem to be able to find files matching the selection criteria, aka the actual name of problem columns, but I cannot extract the real index of such files in the list. Any pointers?

library(data.table)

#read in the example file that have the problem column content
df_var <- read.delim("ctrl_S3127064__3S_DMSO_00_none.TXT", header = T, sep = "\t")

#read in a file that I want to use as reference
df_standard <- read.delim("ctrl__S162465_20190111_T8__3S_2DG_3mM_none.TXT", header = T, sep = "\t")

#get the names of columns of each file
standar.names <- names(df_standard)
var.names <- names(df_var)

same.titles <- var.names %in% standar.names

dff.titles <- !var.names %in% standar.names

#confirm the only 3 columns of problem is column 129,130 and 131 
mismatched.names <- colnames(df_var[129:131])

#visual check the names of the problematic columns
mismatched.names


# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                         sep = "\t",
                         header = T,
                         nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)

# get unique names of files
unique_names <- unique(mismatched.names)
unique_names[1]
# decide which files to remove
#here there the "too_keep" returns an integer vector that I don't undestand
#I thought the numbers should represent the ID/index of the elements
#but I have less than 10 files, but the numbers in to_keep are around 1000
#this is probably because it's matching the actually index of the unlisted list
#but if I use to_keep <- which(column_names%in% unique_names[1]) it returns empty vector

to_keep <- which(unlist(column_names)%in% unique_names[1])


#now if I want to slice the file using to_keep the files_to_keep returns NA NA NA
files_to_keep <- files_in_wd[to_keep]

#once I have a list of targeted files, I can remove them into a new folder by using file.remove
library(filesstrings)
file.move(files_to_keep, "C:/Users/mli/Desktop/weeding/need to reanalysis" )

回答1:

If you can distinguish the files you'd like to keep from those you'd like to drop depending on the column names, you could use something along these lines:

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = ';',
                             header = T,
                             nrows = 2)
}

# get column names of all files
column_names <- lapply(l_files, names)
# get unique names of files
unique_names <- unique(column_names)
# decide which files to keep
to_keep <- which(column_names %in% unique_names[1])

files_to_keep <- files_in_wd[to_keep]

If you have many files you should probably avoid the loop or just read in the header of the corresponding file.

edit after your comment:

by adding nrows = 2 the code only reads the first 2 rows + the header.
I assume that the first file in the folder has the structure that you'd like to keep, that's why column_names is checked against unique_names[1].
the files_to_keep contains the names of the files you'd like to keep
you could try to run that on a subset of your data and see if it works and worry about efficiency later. A vectorized approach might work better I think.

edit: This code works with your dummy-data.

library(filesstrings)

# set working directory to folder with generic text files
setwd("C:/Users/tester/Desktop/generic-text-files/dummyset")

# get current working directory and list all files in this directory
wd <- getwd()
files_in_wd <- list.files(wd)

# create an empty list and read in all files from wd
l_files <- list()
for(i in seq_along(files_in_wd)){
  l_files[[i]] <- read.delim(file = files_in_wd[i],
                             sep = "\t",
                             header = T,
                             nrows = 2,
                             encoding = "UTF-8",
                             check.names = FALSE
                            )
}

# get column names of all files
column_names <- lapply(l_files, names)
# decide which files to keep
to_keep <- column_names[[1]] # e.g. column names of file #1 are ok

# check if the other files have the same header:
df_filehelper <- data.frame('fileindex' = seq_along(files_in_wd),
  'filename' = files_in_wd,
  'keep' = NA)

for(i in 2:length(files_in_wd)){
  df_filehelper$keep[i] <- identical(to_keep, column_names[[i]])
}

df_filehelper$keep[1] <- TRUE # keep the original file used for selecting the right columns

# move files out of the current folder:
files_to_move <- df_filehelper$filename[!df_filehelper$keep] # selects file that are not to be kept

file.move(files_to_move, "C:/Users/tester/Desktop/generic-text-files/dummyset/testsubfolder/")

回答2:

Due to the large number and size of files it might be worth looking at alternatives to R, e.g. in bash:

for f in ctrl*.txt
do
  if [[ "$(head -1 ctrl__S162465_20190111_T8__3S_2DG_3mM_none.txt | md5)" != "$(head -1 $f | md5)" ]]
    then echo "$f"
  fi
done

This command compares the column names of the 'good file' to the column names of every file and prints out the names of files that do not match.

来源：https://stackoverflow.com/questions/64506027/r-how-to-find-select-files-in-a-folder-based-on-matching-specific-column-title

标签

list

lapply

filesort

file-move