Finding the most common three-item sequence in a very large file

前端未结

关注

 5  1767

耶瑟儿～ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答

执笔经年 (楼主)

2021-02-02 17:56

I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

EDIT: assumes file sorted by timestamp already.

The second hash table has a key of userid:page-triple, and a value of count of times seen.

I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

#  $1 is userid, $2 is pageid

{
    old = ids[$1];          # map with id, most-recently-seen triple
    split(old,oldarr,"-"); 
    oldarr[1]=oldarr[2]; 
    oldarr[2]=oldarr[3]; 
    oldarr[3] = $2; 
    ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
    tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
    if (oldarr[1] != "") { # don't accumulate incomplete triples
        triples[tripleid]++; }   # count this triple-id
}
END {
    MAX = 0;
    for (tid in  triples) {
        print tid" "triples[tid];
        if (triples[tid] > MAX) MAX = tid;
    }
    print "MAX is->" MAX" seen "triples[tid]" times";
}

0 讨论(0)

查看其它5个回答