Hadoop - composite key

前端 未结 2 1565
盖世英雄少女心
盖世英雄少女心 2020-12-14 22:39

Suppose I have a tab delimited file containing user activity data formatted like this:

timestamp  user_id  page_id  action_id

I want to wri

相关标签:
2条回答
  • 2020-12-14 22:56

    You could write your own class that implements Writable and WritableComparable that would compare your two fields.

    Pierre-Luc Bertrand

    0 讨论(0)
  • 2020-12-14 23:09

    Just compose your own Writable. In your example a solution could look like this:

    public class UserPageWritable implements WritableComparable<UserPageWritable> {
    
      private String userId;
      private String pageId;
    
      @Override
      public void readFields(DataInput in) throws IOException {
        userId = in.readUTF();
        pageId = in.readUTF();
      }
    
      @Override
      public void write(DataOutput out) throws IOException {
        out.writeUTF(userId);
        out.writeUTF(pageId);
      }
    
      @Override
      public int compareTo(UserPageWritable o) {
        return ComparisonChain.start().compare(userId, o.userId)
            .compare(pageId, o.pageId).result();
      }
    
    }
    

    Although I think your IDs could be a long, here you have the String version. Basically just the normal serialization over the Writable interface, note that it needs the default constructor so you should always provide one.

    The compareTo logic tells obviously how to sort the dataset and also tells the reducer what elements are equal so they can be grouped.

    ComparisionChain is a nice util of Guava.

    Don't forget to override equals and hashcode! The partitioner will determine the reducer by the hashcode of the key.

    0 讨论(0)
提交回复
热议问题