I have the following data frame from which I would like to extract rows based on matching strings.
> GEMA_EO5
gene_symbol fold_EO p_value
A different approach is to recognize the duplicate entries in RefSeq_ID as an attempt to represent two data base tables in a single data frame. So if the original table is csv, then normalize the data into two tables
Anno <- cbind(key = seq_len(nrow(csv)), csv[,names(csv) != "RefSeq_ID"])
key0 <- strsplit(csv$RefSeq_ID, ",")
RefSeq <- data.frame(key = rep(seq_along(key0), sapply(key0, length)),
ID = unlist(key0))
and recognize that the query is a subset (select) on the RefSeq table, followed by a merge (join) with Anno
l <- c( "NM_013433", "NM_001386", "NM_020385")
merge(Anno, subset(RefSeq, ID %in% l))[, -1]
leading to
> merge(Anno, subset(RefSeq, ID %in% l))[, -1]
gene_symbol fold_EO p_value BH_p_value ID
1 REXO4 3.245317 1.78e-27 2.281367e-24 NM_020385
2 TNPO2 4.707600 1.60e-23 1.538000e-20 NM_013433
3 DPYSL2 5.097382 1.29e-22 1.062868e-19 NM_001386
Perhaps the goal is to merge with a `Master' table, then
Master <- cbind(key = seq_len(nrow(csv)), csv)
merge(Master, subset(RefSeq, ID %in% l))[,-1]
or similar.