How to use Regexp.union to match a character at the beginning of my string

▼魔方 西西 提交于 2019-12-24 06:04:07

问题


I'm using Ruby 2.4. I want to match an optional "a" or "b" character, followed by an arbitrary amount of white space, and then one or more numbers, but my regex's are failing to match any of these:

2.4.0 :017 > MY_TOKENS = ["a", "b"]
 => ["a", "b"]
2.4.0 :018 > str = "40"
 => "40"
2.4.0 :019 > str =~ Regexp.new("^[#{Regexp.union(MY_TOKENS)}]?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :020 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z^0-9]*$")
 => nil
2.4.0 :021 > str =~ Regexp.new("^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+$")
 => nil

I'm stumped as to what I'm doing wrong.


回答1:


I believe you want to match a string that may contain any of the alternatives you defined in the MY_TOKENS, then 0+ whitespaces and then 1 or more digits up to the end of the string.

Then you need to use

Regexp.new("\\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\\d+\\z").match?(s)

or

/\A#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+\z/.match?(s)

When you use a Regexp.new, you should rememeber to double escape backslashes to define a literal backslash (e.g. "\d" is a digit matching pattern). In a regex literal notation, you may use a single backslash (/\d/).

Do not forget to match the start of a string with \A and end of string with \z anchors.

Note that [...] creates a character class that matches any char that is defined inside it: [ab] matches an a or b, [program] will match one char, either p, r, o, g, r, a or m. If you have multicharacter sequences in the MY_TOKENS, you need to remove [...] from the pattern.

To make the regex case insensitive, pass a case insensitive modifier to the pattern and make sure you use .source property of the Regex.union created regex to remove flags (thanks, Eric):

Regexp.new("(?i)\\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\\d+\\z")

or

/\A#{Regexp.union(MY_TOKENS).source}?[[:space:]]*\d+\z/i

The regex created is /(?i-mx:\Aa|b?[[:space:]]*\d+\z)/ where (?i-mx) means the case insensitive mode is on and multiline (dot matches line breaks and verbose modes are off).




回答2:


If they are single characters, just use MY_TOKENS.join inside the character class:

MY_TOKENS = ["a", "b"]
str = "40"
first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/
# /^[ab]?[[:space:]]*\d+[^a-z0-9]*$/ 
puts str =~ first_regex
# 0

You can also integrate the Regexp.union, it might lead to some unexpected bugs though, because the flags of the outer regexp won't apply to the inner one :

second_regex = /^#{Regexp.union(MY_TOKENS)}?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?-mix:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ second_regex
# 0

The above regex looks a lot like what you did, but using // instead of Regexp.new prevents you from having to escape the backslashes.

You could use Regexp#source to avoid this behaviour :

third_regex = /^(?:#{Regexp.union(MY_TOKENS).source})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ third_regex
# 0

or simply build your regex union :

fourth_regex = /^(?:#{MY_TOKENS.join('|')})?[[:space:]]*\d+[^a-z0-9]*$/
# /^(?:a|b)?[[:space:]]*\d+[^a-z0-9]*$/
puts str =~ fourth_regex
# 0

The 3 last examples should work fine if MY_TOKENS has words instead of just characters.

first_regex, third_regex and fourth_regex should all work fine with /i flag.

As an example :

first_regex = /^[#{MY_TOKENS.join}]?[[:space:]]*\d+[^a-z0-9]*$/i
"A 40" =~ first_regex
# 0


来源:https://stackoverflow.com/questions/43054542/how-to-use-regexp-union-to-match-a-character-at-the-beginning-of-my-string

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!