Finding the most common three-item sequence in a very large file

前端 未结 5 1737
耶瑟儿~
耶瑟儿~ 2021-02-02 17:09

I have many log files of webpage visits, where each visit is associated with a user ID and a timestamp. I need to identify the most popular (i.e. most often visited) three-page

5条回答
  •  执笔经年
    2021-02-02 17:56

    I think you only have to store the most recently seen triple for each userid right? So you have two hash tables. The first containing key of userid, value of most recently seen triple has size equal to number of userids.

    EDIT: assumes file sorted by timestamp already.

    The second hash table has a key of userid:page-triple, and a value of count of times seen.

    I know you said c++ but here's some awk which does this in a single pass (should be pretty straight-forward to convert to c++):

    #  $1 is userid, $2 is pageid
    
    {
        old = ids[$1];          # map with id, most-recently-seen triple
        split(old,oldarr,"-"); 
        oldarr[1]=oldarr[2]; 
        oldarr[2]=oldarr[3]; 
        oldarr[3] = $2; 
        ids[$1]=oldarr[1]"-"oldarr[2]"-"oldarr[3]; # save new most-recently-seen
        tripleid = $1":"ids[$1];  # build a triple-id of userid:triple
        if (oldarr[1] != "") { # don't accumulate incomplete triples
            triples[tripleid]++; }   # count this triple-id
    }
    END {
        MAX = 0;
        for (tid in  triples) {
            print tid" "triples[tid];
            if (triples[tid] > MAX) MAX = tid;
        }
        print "MAX is->" MAX" seen "triples[tid]" times";
    }
    

提交回复
热议问题