How to calculate percentage of cells in data frame that start with sequence in R?

问题

I have data that looks like:

Row 1     Row 2     Row 3     Row 4     Row 5     Row 6     Row7
abc89     abc62     67        abc513    abc512    abc81     abc10
abc6      pop       abc11     abc4      giant     13        abc15
abc90     abc16     abc123    abc33     abc22     abc08     9
111       abc15     abc72     abc36     abc57     abc9      abc55

I would like to calculate the percentage of cells in the data frame that begin with "abc". For example: there are 28 total cells here. This can be gotten by prod(dim(df)). So I need the # of cells that start with "abc" and then divide it by prod(dim(df)). Here the answer would be 0.785. how can this be done in R?

回答1:

I would use:

> mean(grepl("^abc",unlist(dat)))
[1] 0.7857143

Using mean means you don't have to get the numerator and denominator yourself separately. grepl is the logical version of grep--it returns TRUE whenever "^abc" (i.e., a string beginning with abc) is found. Recall that the average of a Bernoulli vector is the percentage of successes.

If you wanted to do this by row or by column you'd use apply, e.g. apply(dat,1,function(x)mean(grepl("^abc",x))) to get the row-wise means.

回答2:

You can use grep to search for the pattern of interest (a string starting with "abc"):

length(grep("^abc", as.character(unlist(dat)))) / prod(dim(dat))
# [1] 0.7857143

You can get row counts with:

(row.counts <- apply(dat, 1, function(x) length(grep("^abc", as.character(x)))))
# [1] 6 4 6 6

Data:

dat = read.table(text="Row1     Row2     Row3     Row4     Row5     Row6     Row7
 abc89     abc62     67        abc513    abc512    abc81     abc10
 abc6      pop       abc11     abc4      giant     13        abc15
 abc90     abc16     abc123    abc33     abc22     abc08     9
 111       abc15     abc72     abc36     abc57     abc9      abc55", header=TRUE)

来源：https://stackoverflow.com/questions/31775978/how-to-calculate-percentage-of-cells-in-data-frame-that-start-with-sequence-in-r

标签

dataframe

percentage