Snowflake SQL Regex ~ Extracting Multiple Vals

问题

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()

The values that I want to access are in quotes:

...
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
...

*There is a lot of text above and below this.

I want to extract A, B, and C for all columns. But I am unsure how. I have tried using regexp_substr() but haven't been able to isolate past the first value. I have tried:

REGEXP_SUBSTR(col, 'Type\\W+(\\w+)\\W+\\w.+\\W+\\w.+')

which yields:

Type: a: - !<string> val: "A"

and while that gives the first portion of the string with "A", I just want a way to access "A", "B", and "C" individually.

回答1:

You have to extract the values in two stages;

Extract the section of the document below Type: a: containing all the val: "data".
Extract the "data" as an array or use REGEXP_SUBSTR() + index n to extract the nth element

SELECT
  'Type:\\s+\\w+:((\\s+- !<string>\\s+val:\\s+"[^"]")+)' type_section_rx
  REGEXP_SUBSTR(col, type_section_rx, 1, 1, 'i', 1) vals,
  PARSE_JSON('[0' || REPLACE(vals, REGEXP_SUBSTR(vals, '[^"]+'), ', ') || ']') raw_array,
  ARRAY_SLICE(raw_array, 1, ARRAY_SIZE(raw_array)) val_array,
  val_array[1] B
FROM INPUT_STRING

The result is an array where you can access the first value with the index [0] etc.
The first regexp can be shortened down to a "least effort" 'Type:\\s+\\w+:(([^"]+"[^"]+")+)'.

回答2:

This select statement will give you what you want ... sorta. You should notice that it will look for the a specific occurence of "val" and then give you the next word character after that.

REGEX to my knowledge evaluates to the first occurence of the expression, so once the pattern is found it's done. You may want to look at the Snowflake JavaScript Stored Procedure to see if you can take the example below and iterate through, incrementing the appropriate value to produce the expected output.

SELECT REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 1, 'e', 1) as A,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 2, 'e', 1) as B,
       REGEXP_SUBSTR('Type: a:- !<string>val: "A" - !<string> val: "B" - !<string> val: "C"','val\\W+(\\w+)', 1, 3, 'e', 1) as C;

回答3:

One more angle -- Use javascript regex capabilities in a UDF.

For example:

create or replace function my_regexp(S text)
  returns array
  language javascript
as
$$
  const re = /(\w+)/g
  return [...S.match(re)]
$$
;

Invoked this way:

set S = '
Type:
  a:
    - !<string>
     val: "A"
    - !<string>
     val: "B"
    - !<string>
     val: "C"
';

select my_regexp($S);

Yields:

[ "Type", "a", "string", "val", "A", "string", "val", "B", "string", "val", "C" ]

Implementing your full regex is a little more work but as you can see, this gets around the single value limitation.

That said, if performance is your priority, I would expect Snowflake native regex support to outperform, even though you specify the regex multiple times, though I haven't tested this.

来源：https://stackoverflow.com/questions/59188198/snowflake-sql-regex-extracting-multiple-vals

标签

sql

regex

snowflake-data-warehouse