Regular Expression in R - Spaces before and after the text

强颜欢笑 提交于 2021-01-29 13:06:07

问题


I have a stats file that has lines that are like this: "system.l2.compressor.compression_size::1 0 # Number of blocks that compressed to fit in 1 bits"

0 is the value that I care about in this case. The spaces between the actual statistic and whatever is before and after it are not the same each time.

My code is something like that to try and get the stats.

if (grepl("system.l2.compressor.compression_size::1", line))
    {
      matches <- regmatches(line, gregexpr("[[:digit:]]+\\.*[[:digit:]]", line))
      compression_size_1 = as.numeric(unlist(matches))[1]
    }

The reason I have this regular expression

[[:digit:]]+\\.*[[:digit:]]

is because in other cases the statistic is a decimal number. I don't anticipate in the cases that are like the example I posted for the numbers to be decimals, but it would be nice to have a "fail safe" regex that can capture even such a case.

In this case I get "2." "1" "0" "1" as answers. How can I restrict it so that I can get only the true stat as the answer?

I tried using something like this

"[:space:][[:digit:]]+\\.*[[:digit:]][:space:]"

or other variations, but either I get back NA, or the same numbers but with spaces surrounding them.


回答1:


Here are a couple base R possibilities depending on how your data is set up. In the future, it is helpful to provide a reproducible example. Definitely provide one if these don't work. If the pattern works, it will probably be faster to adapt it to a stringr or stringi function. Good luck!!

# The digits after the space after the anything not a space following "::"
gsub(".*::\\S+\\s+(\\d+).*", "\\1", strings)
[1] "58740" "58731" "70576"

# Getting the digit(s) following a space and preceding a space and pound sign
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Combining the two (this is the most restrictive)
gsub(".*::\\S+\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Extracting the first digits surounded by spaces (least restrictive)
gsub(".*?\\s+(\\d+)\\s+.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Or, using stringr for the last pattern:
as.numeric(stringr::str_extract(strings, "\\s+\\d+\\s+"))
[1] 58740 58731 70576

EDIT: Explanation for the second one:

gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
  • .* - .=any character except \n; *=any number of times
  • \\s+ - \\s =whitespace; +=at least one instance (of the whitespace)
  • (\\d+) - ()=capture group, you can reference it later by the number of occurrences (i.e., the ”\\1” returns the first instance of this pattern); \\d=digit; +=at least one instance (of a digit)
  • \\s+# - \\s =whitespace; +=at least one instance (of the whitespace); # a literal pound sign
  • .* - .=any character except \n; *=any number of times

Data:

strings <- c("system.l2.compressor.compression_size::256 58740 # Number of blocks that compressed to fit in 256 bits",
             "system.l2.compressor.encoding::Base*.8_1 58731 # Number of data entries that match encoding Base8_1",
             "system.l2.overall_hits::.cpu.data 70576 # number of overall hits")


来源:https://stackoverflow.com/questions/57384597/regular-expression-in-r-spaces-before-and-after-the-text

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!