find all two word phrases that appear in more than one row in a dataset

问题

We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery

How can we write this query in Google BigQuery?

The dataset is simply a long list of English sentences.

回答1:

Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.

This is a hack, but a hack I happen to like :).

In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

Try it on your data, and please report back.

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.

回答2:

A very useful hack which inspired me to solve my problem, thanks.

My data is a combination of passengers and their age where age is a string of numbers:

adults ages
------ -------------
  4    "53,67,65,68"       
  4    "44,45,69,65" 
  3    "20,21,20"
  3    "30,32,62"

I wanted to add a column on each row containing the difference in age between the highest and lowest value

adults ages          agediff
------ ------------- -------
   4   "53,67,65,68" 15       
   4   "44,45,69,65" 25
   3   "20,21,20"    1
   3   "30,32,62"    32

This was done by the following, heavily inspired by the hack:

SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM 
 (SELECT adults, ages, max(age) maxage, min(age) minage FROM
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
  ),
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
  )

)

来源：https://stackoverflow.com/questions/18709547/find-all-two-word-phrases-that-appear-in-more-than-one-row-in-a-dataset

标签

data-mining

bigdata

google-bigquery

data-analysis

n-gram