Extracting a numerical value from a paragraph based on preceding words

问题

I'm working with some big text fields in columns. After some cleanup I have something like below:

truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]

I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.

This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.

Any help would be much appreciated!

回答1:

You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:

xerb Scale:\s*\["(\d+)"]
                 ^^^^^

See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:

SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
                                                      ^^^

Note the 1 argument:

regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

回答2:

I'm sure there are many, many ways to do this. One way that would work in bash (based on the test data you've provided) is:

awk -F':' '/xerb Scale/ {print $2}' file | tr -cd '[:alnum:]._-'

There are obvious caveats to this approach - if you provide more info you will likely get a better answer :)

来源：https://stackoverflow.com/questions/60536150/extracting-a-numerical-value-from-a-paragraph-based-on-preceding-words

标签

regex

presto