Subset data in a large list based on filename of the dataframes in the list

问题

I'm working with a large list that contains 450 dataframes. I'll make an example of the names of the dataframes:

ALL_SM51_SE1_hourly, ALL_SM201_SE1_hourly, ALL_SM501_SE1_hourly
ALL_SM51_SE2_hourly, ALL_SM201_SE2_hourly, ALL_SM501_SE2_hourly
...................................................................
ALL_SM51_SE150_hourly, ALL_SM201_SE150_hourly, ALL_SM501_SE150_hourly

The dataframes contain measured soil moisture data at different depths (5cm, 20cm, 50cm, represented by "SM51, SM201, SM501" in the filenames) and there are 150 sensors (represented by the "SE1, SE2, SE3, ..." in the filename) which is why I have 450 dataframes that are stored in a list.

What I would like to do: I want to create a new list (make a subset) for each sensor that then contains 3 elements. So I wanna have a list for SE1, SE2, SE3, ..., SE150 with the corresponding measuring depths.

I already searched for an appropriate answer to my question but I only found answers that subset data by specific values but I want to subset by the filenames.

Does anyone know how to do this?

回答1:

Using regular expressions you may identify unique sensors un.se which you can paste to new.names. The original list lst then can be split into unique sensors, ordered and converted into data.frames.

un.se <- gsub(".*SE(\\d+).*", "\\1", names(lst))
new.names <- paste0("SE", unique(un.se))
tmp <- setNames(split(lst, un.se), paste0("SE", unique(un.se)))
res <- lapply(tmp, function(x) {
  nm <- gsub(".*SM(\\d+).*", "\\1", names(x))
  setNames(lapply(x[order(nm)], data.frame), paste0("d", gsub("1$", "", nm)))
  })

Explanation gsub-regex:

In the regex .* looks for any "character-until", then we have SE literally. Now we use grouping inside parentheses ( ), where we look with \\d+ for one or more occurrences of a number or digit. In the second gsub-argument \\1 does a back-reference on the first group (that in the parentheses) to replace the whole string. E.g. resulting un.se is the number found after each SE in each string (see: https://regex101.com/r/zuO8Ts/1; and note that we need double escapes \\ in R).

This lists each sensor with data frames for each depth in sublists.

Result

res
# $SE1
# $SE1$d5
#   x1 x2 x3
# 1  1  2  3
# 
# $SE1$d20
#   x1 x2 x3
# 1  1  2  3
# 
# $SE1$d50
#   x1 x2 x3
# 1  1  2  3
# 
# 
# $SE2
# $SE2$d5
#   x1 x2 x3
# 1  1  2  3
# 
# $SE2$d20
#   x1 x2 x3
# 1  1  2  3
# 
# $SE2$d50
#   x1 x2 x3
# 1  1  2  3

Toy data

lst <- list(ALL_SM51_SE1_hourly = list(x1 = 1, x2 = 2, x3 = 3), ALL_SM201_SE1_hourly = list(
    x1 = 1, x2 = 2, x3 = 3), ALL_SM501_SE1_hourly = list(x1 = 1, 
    x2 = 2, x3 = 3), ALL_SM51_SE2_hourly = list(x1 = 1, x2 = 2, 
    x3 = 3), ALL_SM201_SE2_hourly = list(x1 = 1, x2 = 2, x3 = 3), 
    ALL_SM501_SE2_hourly = list(x1 = 1, x2 = 2, x3 = 3))

来源：https://stackoverflow.com/questions/60926815/subset-data-in-a-large-list-based-on-filename-of-the-dataframes-in-the-list

标签

list

subset

filenames