regex for N (1 to 3 digit) numbers in square brackets with commas+spaces between them

有些话、适合烂在心里 提交于 2019-12-12 01:37:53

问题


I'm going to try to phrase this clearly... (I'm pretty new at regex). I'm working on a PDF document, with a program called AutoBookmark (from Evermap). I'm trying to set it up to link numbered citations to numbered references in a bibliography.

The goal is to match each numbered citation within brackets, and return that number within brackets, alone. In other words, if I have [85], I'd just return [85]. If I have [85, 93], I'd return both [85] and [93]. If there are more numbers in brackets, up to N numbers, I'd return N of them (in brackets). If there is a range, i.e., [85-93], I only need to return the first.

So it seems to me I'm asking this: the number (1 to 3 digits), only if preceded by EITHER an opening bracket, OR another number followed by a comma and a space, but only if that number is preceded by an opening bracket OR by a number followed by a comma and a space, but only if... you get the picture. Iterate until you hit a bracket (then return the number) or a non-number, in which case, don't return the number. Is this something even reasonable to ask of a regular expression? Or, since I'm doing this in a PDF, must I do a Javascript routine? (which BTW, I also don't know how to do!) Thanks! I know I'm a newbie at this, and I'm grateful for any thoughts.


回答1:


I have no experience with this program, but this should work with javascript, and thus other feature-minimal implementations of Regex.

\[?\s*(\d+)\s*(?=(?:,\s*\d+)+|\])(?=[^\[]*\]).

\[?          # Literal [, zero or 1 times
\s*          # Any number (*) of whitespace characters
(\d+)        # Any number of digits, one or more (+)
\s*          # Any number (*) of whitespace characters
(?=          # Positive lookahead, support for possitive lookahead is key to the regex
  (?:        # Open non-capturing group
    ,\s*\d+  # Literal ",", any number of whitespace characters, 
               # digits one or more
  )          # Close non-capturing group
|            # or
  \]         # Literal "]"
)            # Close positive lookahead
(?=          # Open another positive lookahead
  [^\[]*\]   # Any number of characters that are not "[", as long as they're followed by "]".
               # This is only a validation check, those characters won't be caught
)            # Close positive lookahead
.            # Match any character except newline

If this program supports variable-length bookbehinds, you can use this, which only adds a lookbehind to makesure the number is prefixed by valid characters as well.

\[?\s*(?<=\[[,\d ]*)(\d+)\s*(?=(?:,\s*\d+)+|\])(?=[^\[]*\]).

If your citation format is 100% reliable [1], [12], [13, 14, 21], etc. You can use a simpler version

\[?\s*(\d+)(?=(?:, \d+)|\])(?=[^\[]*\]). or this if your program supports variable-length lookbehinds, \[(?<=\[[,\d ]*)(\d+)(?=(?:, \d+)|\])(?=[^\[]*\])..

With any of these expressions: You can change the last character, ., to \]? to see the citations still separated by commas [1],[15],[22].

* In many flavors of regular expressions, lookbehinds--if supported at all, must be a fixed-length with no quantifiers and all alternation being the same width. For instance, (?<=a|1) will work but (?<=a|12), (<=a|1+) or (<=a+) will fail. As will quantifiers applied to the lookbehind itself (?<=a)+

Edit: And thanks for Rawing for input.




回答2:


Thanks for the suggestions! Here's what happens. Apparently, Evermap doesn't understand variable-length lookarounds, so I tried your other ones. They give some results, but not all. They match simple numbers in brackets, and they match the last number in a series within brackets.

AutoBookmark does offer a "multiple rule" way of searching for text patterns, so I could look for [35] or [35 or , 35] or , 35, or 35- all individually.

Right now, I'm using three rules:

(\[)(\d{1,3})(\]|,)

\[?\s*(\d+)(?=(?:, \d+)|\])(?=[^\[]*\]).

(\[|\s)(\d{1,3})\-

For each of these, the 'replace', or what the program calls 'link action', is the extracted number, or \2.

This gets me most of what I want, but if there are more than two numbers in a series, separated by comma+space, it doesn't match the middle numbers. I would do that by hand, I suppose, if I can't find a better way.

I know I'm stumbling around here... Thanks for helping, and thanks for being patient with a newbie! (If I work this out so it's fully automated, I'll be a god at work...)



来源:https://stackoverflow.com/questions/42603140/regex-for-n-1-to-3-digit-numbers-in-square-brackets-with-commasspaces-between

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!