find all two word phrases that appear in more than one row in a dataset

后端 未结 2 908
旧巷少年郎
旧巷少年郎 2020-12-18 13:56

We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string \"Data Ninja\". Since it appears in more than one ro

相关标签:
2条回答
  • 2020-12-18 14:13

    Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.


    This is a hack, but a hack I happen to like :).

    In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

    Try it on your data, and please report back.

    SELECT pairs, COUNT(*) c FROM
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    ),
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    ),
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    ),
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    ),
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    ),
    (
    SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
    FROM [bigquery-samples:reddit.full]
    )
    WHERE pairs != title
    GROUP EACH BY pairs
    HAVING c > 1
    LIMIT 1000
    

    Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

    0 讨论(0)
  • 2020-12-18 14:14

    A very useful hack which inspired me to solve my problem, thanks.

    My data is a combination of passengers and their age where age is a string of numbers:

    adults ages
    ------ -------------
      4    "53,67,65,68"       
      4    "44,45,69,65" 
      3    "20,21,20"
      3    "30,32,62"
    

    I wanted to add a column on each row containing the difference in age between the highest and lowest value

    adults ages          agediff
    ------ ------------- -------
       4   "53,67,65,68" 15       
       4   "44,45,69,65" 25
       3   "20,21,20"    1
       3   "30,32,62"    32
    

    This was done by the following, heavily inspired by the hack:

    SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM 
     (SELECT adults, ages, max(age) maxage, min(age) minage FROM
      (SELECT adults, ages, age FROM 
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
      ),
      (SELECT adults, ages, age FROM 
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
       (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
      )
    

    )

    0 讨论(0)
提交回复
热议问题