How can I efficiently transform a two-column range into an expanded table?

问题

I'm trying to use geo IP data in snowflake. This involves several things:

1) A source table with a CIDR IP range and a geoname_ID and its lat/long coords

2) I've used the parse_ip function and extracted the range_start and range_end values as simple integer columns in the ipv4 0-4.2bn range. Some ranges consist of 1 IP, some may have as many as 16.7 million.

So, the 3.1 million rows in the intermediary table data look something like this :

RANGE_START RANGE_END GEONAME_ID LATITUDE LONGITUDE

214690946 214690946 4556793 39.84980011 -75.37470245

214690947 214690947 6252001 37.75099945 -97.82199860

214690948 214690951 6252001 37.75099945 -97.82199860

214690952 214690959 6252001 37.75099945 -97.82199860

214690960 214690975 6252001 37.75099945 -97.82199860

As you can see, a geoname ID can have multiple ranges associated with it.

The problem is joining a (parsed into an integer value) IP with this table requires non-equality joins, which are painfully slow in snowflake at the moment (about 1000x slower empirically). So I would like to expand the table above into having one row per IP in range, i.e the last row with the range 214690960 to 214690975 would turn into 16 rows, while preserving geoname and lat long for each of the new rows. The only way I could think to do this was by doing a non-equi join to a generator table, but this took 30 minutes on a 3xl for 1000 rows, generating about 1.2m result rows. I have 3.1 million rows in this range to flatten, so that won't work.

Any ideas, anyone? Here is what I tried so far:

create OR REPLACE table GENERATOR_TABLE (IP INT);

INSERT INTO GENERATOR_TABLE  SELECT ROW_NUMBER() over (ORDER BY NULL)  AS IP FROM TABLE(GENERATOR(ROWCOUNT => 4228250627)) ORDER BY IP;

create or replace table GEO_INTERMEDIARY as 
(select network_parsed:ipv4_range_start::number as range_start, network_parsed:"ipv4_range_end"::number range_end, geoname_id, latitude, longitude from  GEO_SOURCE order by range_start, range_end);

CREATE OR REPLACE TABLE EXPANDED_GEO AS 
select * from (select * from GEO_INTERMEDIARY order by geoname_id limit 1000 offset 0) A
JOIN GENERATOR_TABLE B ON B.IP >= A.RANGE_START AND B.IP <= A.RANGE_END
ORDER BY IP;

回答1:

For such pattern you could indeed try using generator, but I usually end up using JavaScript UDTFs.

Here's an example function and usage on your data:

create or replace table x(
RANGE_START int,
RANGE_END int,
GEONAME_ID int,
LATITUDE double,
LONGITUDE double
) as 
select * from values
(214690946,214690946,4556793,39.84980011,-75.37470245),
(214690947,214690947,6252001,37.75099945,-97.82199860),
(214690948,214690951,6252001,37.75099945,-97.82199860);

create or replace function magic(
  range_start double,
  range_end double,
  geoname_id double,
  latitude double,
  longitude double
) 
returns table (
  ip double,
  geoname_id double,
  latitude double,
  longitude double
) language javascript as 
$$
{
  processRow: function(row, rowWriter, context) {
    let start = row.RANGE_START
    let end = row.RANGE_END

    while (start <= end) {
      rowWriter.writeRow({
        IP: start,
        GEONAME_ID: row.GEONAME_ID,
        LATITUDE: row.LATITUDE,
        LONGITUDE: row.LONGITUDE,
      });
      start++;
    }
  }
}
$$;
select m.* from x, 
  table(magic(range_start::double, range_end::double, 
              geoname_id::double, latitude, longitude)) m;
-----------+------------+-------------+--------------+
    IP     | GEONAME_ID |  LATITUDE   |  LONGITUDE   |
-----------+------------+-------------+--------------+
 214690946 | 4556793    | 39.84980011 | -75.37470245 |
 214690947 | 6252001    | 37.75099945 | -97.8219986  |
 214690948 | 6252001    | 37.75099945 | -97.8219986  |
 214690949 | 6252001    | 37.75099945 | -97.8219986  |
 214690950 | 6252001    | 37.75099945 | -97.8219986  |
 214690951 | 6252001    | 37.75099945 | -97.8219986  |
-----------+------------+-------------+--------------+

The only gotcha here is that JS only supports double types, but for this data, it's ok, you will not see any precision loss.

I tested it on 1M ranges producing 10M IPs, it finished in seconds.

来源：https://stackoverflow.com/questions/58418039/how-can-i-efficiently-transform-a-two-column-range-into-an-expanded-table

标签

join

snowflake-data-warehouse