Extracting a numerical value from a paragraph based on preceding words

别来无恙 提交于 2020-07-31 04:09:26

问题


I'm working with some big text fields in columns. After some cleanup I have something like below:

truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]

I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.

This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.

Any help would be much appreciated!


回答1:


You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:

xerb Scale:\s*\["(\d+)"]
                 ^^^^^  

See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:

SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
                                                      ^^^

Note the 1 argument:

regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group




回答2:


I'm sure there are many, many ways to do this. One way that would work in bash (based on the test data you've provided) is:

awk -F':' '/xerb Scale/ {print $2}' file | tr -cd '[:alnum:]._-'

There are obvious caveats to this approach - if you provide more info you will likely get a better answer :)



来源:https://stackoverflow.com/questions/60536150/extracting-a-numerical-value-from-a-paragraph-based-on-preceding-words

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!