Fast count of digits in a string, in R

狂风中的少年 提交于 2020-01-24 17:14:13

问题


Is there a more efficient way to count the most frequently appearing digit in a string? My R code below calls gsub() 10 times for each string; and I have gazillions of strings to process.

> txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
> max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', txt)), integer(1L)))
[1] 12

I don't care about the digit itself. I just want the count of the most frequent one.

I would prefer to use R's core packages, unless some external package offers a significant outperformance. I use x64 R version 3.4.1 (2017-06-30) on Windows 10.

UPDATE:

Here is the (apples-to-apples) performance comparison of excellent suggestions below.

> microbenchmark(
+     original = max(vapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', s)), integer(1L))),
+     strsplit = max(table(unlist(strsplit(gsub("\\D+", "", s), "")))),
+     gregexpr = max(vapply(0:9, function(d) sum(unlist(gregexpr(d, s)) > 0), integer(1L))),
+     stringi = max(vapply(0:9, function(x) stri_count_fixed(s, x), integer(1L))),
+     raw=max(vapply(0x30:0x39, function(x) sum(charToRaw(s)==x), integer(1L))),
+     tabulate = max(tabulate(as.integer(charToRaw(paste('a',s))))[48:57]),
+     times=1000L)
Unit: microseconds
     expr     min       lq      mean   median       uq      max neval
 original 476.172 536.9770 567.86559 554.8600 580.0530 8054.805  1000
 strsplit 366.071 422.3660 448.69815 445.3810 469.6410  798.389  1000
 gregexpr 302.622 345.2325 423.08347 360.3170 378.0455 9082.416  1000
  stringi 112.589 135.2940 149.82411 144.6245 155.1990 3910.770  1000
      raw  58.161  71.5340  83.57614  77.1330  82.1090 6249.642  1000
 tabulate  18.039  29.8575  35.20816  36.3890  40.7430   72.779  1000

Why the weird calculation?

This odd formula helps identify some plainly-looking fake identifiers entered by the user. For example, some non-creative users (I'm a guilty one as well) fill out same digits for their phone numbers. Frequently, in data analysis, it would be better to have no phone number at all than a fake phone number that changes from one dataset to another. Naturally, if there is a check-digit, it would be an additional easy validation.


回答1:


Using charToRaw to count digits in string:

# To count only digits in string, filter out ASCii codes for numbers from 0 to 9 which is 48 to 57 according to https://ascii.cl/
# You need to add na.rm = TRUE in case some of your strings contain only one digit
txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 12

txt='22222222222'
max(tabulate(as.integer(charToRaw(txt)))[48:57], na.rm = TRUE)
#[1] 11

@Andrew already did benchmarking test which proves that using charToRaw is fastest approach to count digits in string.

If you do not care about the digit and just want to count most frequent character/digit then you just remove filtering ASCII codes [48:57].

txt = 'wow:011 test 234567, abc=8951111111111aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 32

txt='22222222222'
max(tabulate(as.integer(charToRaw(txt))))
#[1] 11



回答2:


max(table(unlist(strsplit(gsub("\\D+", "", txt), ""))))
#OR
max(sapply(0:9, function(d) sum(unlist(gregexpr(d, txt)) > 0)))
#[1] 12

Or if you do care about the digit

with(rle(sort(unlist(strsplit(gsub("\\D+", "", txt), "")))),
     setNames(c(max(lengths)), values[which.max(lengths)]))
# 1 
#12 

library(microbenchmark)
set.seed(42)
t = paste(sample(c(letters, 0:9), 1e5, TRUE), collapse = "")
microbenchmark(original = max(sapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', t)))),
               strsplit = max(table(unlist(strsplit(gsub("\\D+", "", t), "")))),
               gregexpr = max(sapply(0:9, function(d) sum(unlist(gregexpr(d, t)) > 0))))
#Unit: milliseconds
#     expr        min         lq       mean     median         uq       max neval cld
# original 215.371764 220.862807 233.368696 228.757529 239.809292 308.94393   100   c
# strsplit  11.224226  11.856327  12.956749  12.320586  12.893789  30.61072   100  b 
# gregexpr   7.542871   7.958818   8.680391   8.302971   8.728735  13.79921   100 a  



回答3:


Building on Santosh's approach, this is significantly faster than the other options...

max(tabulate(as.integer(charToRaw(txt)))[48:57]) #48:57 picks out ASCII digits

library(microbenchmark)
set.seed(42)
t = paste(sample(c(letters, 0:9), 1e5, TRUE), collapse = "")
microbenchmark(original = max(sapply(0:9, function(i) nchar(gsub(paste0('[^',i,']'), '', t)))),
               strsplit = max(table(unlist(strsplit(gsub("\\D+", "", t), "")))),
               gregexpr = max(sapply(0:9, function(d) sum(unlist(gregexpr(d, t)) > 0))),
               tabulate = max(tabulate(as.integer(charToRaw(t)))[48:57]))

Unit: milliseconds
     expr        min         lq        mean     median          uq       max neval
 original 807.947235 860.112901 1169.744733 935.169003 1154.057709 3513.1401   100
 strsplit  34.100444  36.453163   55.457896  42.881400   58.208820  390.1453   100
 gregexpr  27.205510  29.333569   42.616817  33.146572   49.840566  246.9001   100
 tabulate   1.189702   1.208321    2.150022   1.226319    1.297068   37.4300   100 


来源:https://stackoverflow.com/questions/47516752/fast-count-of-digits-in-a-string-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!