问题
I have a data frame:
df <- data.frame(
Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
Dominantspp = c("OM", "OM", "RSS", "CH"),
Commonspp = c(" ", " ", " ", "OM"),
Rarespp = c(" ", " ", "SD", "NP"),
NP = rep("northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM", 4),
OM = rep("steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM", 4),
RSS = rep("redside shiner|REDSIDE SHINER|rs|RS|rss|RSS", 4),
suck = rep("suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS", 4)
)
I need to use the columns populated with common fish codes/names (NP, OM, RSS, suck) to evaluate the expressions in the first four columns and output a 1/0 based on each of those columns, if the expression is met EXACTLY. The code I have below does not match full words (only partial) and provides incorrect data (see resulting tibble below).
df %>%
rowwise() %>%
transmute_at(vars(NP, OM, RSS, suck),
funs(case_when(
grepl(., Dominantspp) ~ "1",
grepl(., Commonspp) ~ "1",
grepl(., Rarespp) ~ "1",
grepl(., Otherspp) ~ "1",
TRUE ~ "0"))) %>%
ungroup()
Result: see that in row three, both "suck" and "RSS" receive a "1".
# A tibble: 4 x 4
NP OM RSS suck
<chr> <chr> <chr> <chr>
1 0 1 0 1
2 0 1 0 0
3 0 0 1 1
4 1 1 1 1
Desired output:
NP OM RSS suck
1 0 1 0 1
2 0 1 0 0
3 0 0 1 0
4 1 1 1 0
回答1:
The fastest way to solve your problem using your same approach is to add word boundaries to the beginning and end of each of your regexes, with \\b
:
df <- data.frame(
Otherspp = c("suck SD", "BT", "SD RS", "RSS"),
Dominantspp = c("OM", "OM", "RSS", "CH"),
Commonspp = c(" ", " ", " ", "OM"),
Rarespp = c(" ", " ", "SD", "NP"),
NP = rep("\\b(northern pikeminnow|NORTHERN PIKEMINNOW|np|NP|npm|NPM)\\b", 4),
OM = rep("\\b(steelhead|STEELHEAD|rainbow trout|RAINBOW TROUT|st|ST|rb|RB|om|OM\\b)", 4),
RSS = rep("\\b(redside shiner|REDSIDE SHINER|rs|RS|rss|RSS)\\b", 4),
suck = rep("\\b(suckers|SUCKERS|sucker|SUCKER|suck|SUCK|su|SU|ss|SS)\\b", 4),
stringsAsFactors = FALSE
)
This makes the regular expressions only match full words, which will make your subsequent solution work.
Having said that, I don't think this is necessarily the way to approach the problem (rowwise()
is rarely recommended today, and this approach won't scale well to many fish codes). I think you'd have an easier time working with this data if you standardized it to a tidy format, with one row per combination of row and code:
library(tidyr)
library(tidytext)
row_codes <- df %>%
select(Otherspp:Rarespp) %>%
mutate(row = row_number()) %>%
gather(type, codes, -row) %>%
unnest_tokens(code, codes, token = "regex", pattern = " ")
Which would result in:
row type code
1 1 Dominantspp om
2 1 Otherspp suck
3 1 Otherspp sd
4 2 Dominantspp om
5 2 Otherspp bt
6 3 Dominantspp rss
7 3 Otherspp sd
8 3 Otherspp rs
9 3 Rarespp sd
10 4 Commonspp om
11 4 Dominantspp ch
12 4 Otherspp rss
13 4 Rarespp np
At this point, the codes are much easier to work with (you don't need regular expressions anymore). For example, you could inner_join
it to a table of the fish codes.
来源:https://stackoverflow.com/questions/47933639/transmute-new-columns-based-on-exact-match-of-multiple-words-in-string