问题
I'm looking for a Postgres (actually Redshift) equivalent to Hive's parse_url(..., 'HOST').
Postgres docs say it has a URL parser as part of its full text search. This blog post has a regex which may or may not be bulletproof. What is best?
回答1:
If you weren't using Redshift, I'd say "use PL/Perlu, PL/Python, or one of the other procedural languages to get a regular URL parser". Since you're on a proprietary fork of Pg 8.1 you're going to have to settle for a hacky regexp I suspect.
There is no way to access the full-text search URL parser from the SQL level. You could write a C extension to expose the function to SQL quite easily, but of course you can't install the extension in Redshift, so again it won't do you any good.
Time to abuse regular expressions.
(BTW, thanks for actually saying you're on redshift; too many people say "PostgreSQL" when they mean "a vaguely PostgreSQL based hosted version of ParAccel")
回答2:
Until Redshift starts supporting the regular expression functions of PostgreSQL, if you want to get the host out of an HTTP/S URL in Redshift SQL you'll have to do something like:
select split_part(url, '/', 3) as host from my_table
回答3:
Redshift now has a REGEXP_SUBSTR function:
It searches for the regular expression in the string and returns the first substring that matches. One example of a regex to extract the host:
select REGEXP_SUBSTR(url, '[^/]+\\.[^/:]+') from my_table;
来源:https://stackoverflow.com/questions/17310972/how-to-parse-host-out-of-a-string-in-redshift