Get ranking of words over date based on frequency in PostgreSQL

。_饼干妹妹 提交于 2021-02-11 12:59:06

问题


I have a database that stores twitter data:

        Create Table tweet(
            ID BIGINT UNIQUE,
            user_ID BIGINT,
            created_at TIMESTAMPTZ,
            tweet TEXT;

I'm trying to write a query that goes through the words in tweet for all rows gets the frequency of each word, and returns the top ten most frequent words along with the words' ranking over each date.

Example:

("word1":[1,20,22,23,24,25,26,27,28,29,30,29,28,27,26,25,26,27,28,29,30,29,28,29,28,27,28,29,30,30,...],
'word2' [...])

My current query gets the top ten words, but I am having some trouble getting the rankings of those words for each day.

Current query:

    SELECT word, count(*)
    FROM (
        SELECT regexp_split_to_table(
            regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'), '\s+')
        AS word
    FROM tweet
    ) t
    GROUP BY word
    ORDER BY count(*) DESC
    LIMIT 10;

Which returns:

[('vaccine', 286669),
 ('covid', 213857),
 ('yum', 141345),
 ('pfizer', 39532),
 ('people', 28960),
 ('beer', 27117),
 ('say', 24569),
 ('virus', 23682),
 ('want', 21988),
 ('foo', 19823)]

回答1:


If you want the top 10 per day, you can do:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from tweet t
    cross join lateral regexp_split_to_table(
        regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'),
        '\s+'
    ) w(word)
    group by created_day, word
) t
where rn <= 10
order by created_day, rn desc



回答2:


If I understand correctly, you want 10 rows for the most common words. Then you want an array of frequencies. Assuming that each word is used on each day, this should do that:

select wd.word,
       array_agg(day_rank) over (order by created_day) as ranks
from (select date_trunc('day', t.created_at) as created_day, w.word,
             sum(count(*)) as total_cnt,
             rank() over(partition by date_trunc('day', created_at) order by count(*) desc) as day_rank
      from tweet t cross join lateral
           regexp_split_to_table(regexp_replace(tweet_clean, '\y(rt|co|https|amp|f)\y', '', 'g'
                                               ), '\s+'
                                ) w(word)
      group by created_day, word
     ) wd
order by total_cnt desc
limit 10;

The challenge here is that the arrays could be of different lengths. In Postgres, you can add the additional values -- but it is not exactly clear what should be placed there for the ranking.

The issue is that the ranking is per day. So, consider two days, one that has 100 words and one that has 10 words. In the first, a ranking of "10" is a very high ranking. A ranking of 10 in the second is very low.

I might suggest that you think about this issue and ask a new question if you need help resolving it.



来源:https://stackoverflow.com/questions/65354100/get-ranking-of-words-over-date-based-on-frequency-in-postgresql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!