find all two word phrases that appear in more than one row in a dataset

笑着哭i 提交于 2019-11-28 01:54:19

问题


We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery

How can we write this query in Google BigQuery?

The dataset is simply a long list of English sentences.


回答1:


Good news: BigQuery now supports SPLIT(). Check https://stackoverflow.com/a/24172995/132438.


This is a hack, but a hack I happen to like :).

In its current form, it only works for sentences with more than 2 words, and it only extracts the 6 first pairs. You can extend and test from here.

Try it on your data, and please report back.

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

Results might contain NSFW words. The sample dataset comes from an online community that has not been "cleaned up". Abstain from running query if you are sensitive to some words.




回答2:


A very useful hack which inspired me to solve my problem, thanks.

My data is a combination of passengers and their age where age is a string of numbers:

adults ages
------ -------------
  4    "53,67,65,68"       
  4    "44,45,69,65" 
  3    "20,21,20"
  3    "30,32,62"

I wanted to add a column on each row containing the difference in age between the highest and lowest value

adults ages          agediff
------ ------------- -------
   4   "53,67,65,68" 15       
   4   "44,45,69,65" 25
   3   "20,21,20"    1
   3   "30,32,62"    32

This was done by the following, heavily inspired by the hack:

SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM 
 (SELECT adults, ages, max(age) maxage, min(age) minage FROM
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
  ),
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
  )

)



来源:https://stackoverflow.com/questions/18709547/find-all-two-word-phrases-that-appear-in-more-than-one-row-in-a-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!