How to identify stopwords with BigQuery?

删除回忆录丶 提交于 2019-12-11 15:13:52

问题


I'm looking at reddit comments. I'm using some common stopword lists, but I want to create a custom one for this dataset. How can I do this with SQL?


回答1:


One approach to identify stopwords is to look at the ones that show up in most documents.

Steps in this query:

  1. Filter posts for relevancy, quality (choose your subreddits, choose a minimum score, choose a minimum length).
  2. Unescape reddit HTML encoded values.
  3. Decide what counts as a word (in this case r'[a-z]{1,20}\'?[a-z]+').
  4. Each word counts only once per doc (comment), regardless of how many times it's repeated in each comment.
  5. Get the top x words by counting on how many documents they showed up.

Query:

#standardSQL
WITH words_by_post AS (
  SELECT CONCAT(link_id, '/', id) id, REGEXP_EXTRACT_ALL(
    REGEXP_REPLACE(REGEXP_REPLACE(LOWER(body), '&', '&'), r'&[a-z]{2,4};', '*')
      , r'[a-z]{1,20}\'?[a-z]+') words
  FROM `fh-bigquery.reddit_comments.2017_07`  
  WHERE body NOT IN ('[deleted]', '[removed]')
  AND subreddit IN ('AskReddit', 'funny', 'movies')
  AND score > 100
), words_per_doc AS (
  SELECT id, word
  FROM words_by_post, UNNEST(words) word
  WHERE ARRAY_LENGTH(words) > 20
  GROUP BY id, word
)

SELECT word, COUNT(*) docs_with_word
FROM words_per_doc
GROUP BY 1
ORDER BY docs_with_word DESC
LIMIT 100



来源:https://stackoverflow.com/questions/47058864/how-to-identify-stopwords-with-bigquery

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!