I am building a dictionary on a hadoop cluster and need to generate a numeric id for each token. How should I do it?
To avoid synchronization, sorting and grouping that are not part of you business logic you could use some tricks, which will be faster.
The simplest one is to generate UUIDs in the Reducer, one per each key UUID UUID.randomUUID(), but these are not numeric.
If you want continuous sequence of numeric ids and your output is small enough for one reducer to process, than force one reducer for the job via org.apache.hadoop.mapreduce.Job.setNumReduceTasks(int tasks), thus all of the keys will be directed to a single Reducer.
If output of the Mapper is still too big for a single Reducer and you do not care about the sequence continuity of the ids or your dictionary can be partitioned, than you could use some tricks with a Partitioner (see). The idea is that you can logically partition the keys into N ranges of known length (Ex. range 1 can have 1 mil keys that start with 1, range 4 can have 500 ids that start with 3500000, etc). Logic:
If you do not have any business knowledge of the ranges you could spend some time to do a distinct on the keys and calculate the ranges and their lengths. With this one you could get continuous sequence of ids in the result set.
If you do not want to spend time on doing keys distinct, than the goal is to start ids for each Reducer with different digits (Reducer 1 generates ids that start only with 1 (1, 10, 124523, 1341243), Reducer 2 with 2 (2, 23, 234234532), etc). To do this, calculate mod 10 of the first byte of the key, and force 10 Reducers, direct zeros to same partition as 1 (main reason for that is there are no 2 digit integers that start with 0 and it might cause collision with ids from other partitions), thus output for partition 0 is empty. Than in the reducer side append (concatenate 2 strings!!!) the counter to the (first byte of the key mod 10), where 0 is changed to 1. Each reducer has a counter from 1 to infinity, could be initialized from a file that contains last id used for that partition.
Ex.
key = "abc", ascii of 'a' is 97, 97 % 10 = 7 and id for that key is '7' + '1' = '71',
for "asd" it will be '7' + '244' = '7244',
for "bbv" is '8' + '1' = '81',
for "bgh" is '8' + '2' = '82',
for "ddv", 'd' is ascii 100, 100 % 100 = 0 , convert to 1, '1' + '1' = '11',
for "edv" is 101 % 100 = 1, '1' + '2234' = '12234'.
Because all keys start with a different digit, they never overlap, thus there is no need to synchronize between multiple Reducers. It is guaranteed by string concatenation of the mod results and the counter, so no overflow to next prefix/leading digit. Another advantage is you do not need to do any presorting, which is not part of you business logic. When Reducer is closed, it can write into a file the last counter used in id generation, which could be used in the next run and provide continuity of ids withing a partition of the dictionary.
To increase the number of partitions from 10 to 100 use mod 100 and merge single digit results with 2 digit ones (once again we waste outputs of 10 reducers). Ex 10 % 100 = 10, 1 % 100 = 1 convert to 10, 102 % 100 = 2 convert to 20, by adding '0' as string or multiply by 10. The goal is to have all prefixes to have same number of digits, in this case 2.
With clever logic we can avoid wasting skipped partitions (0, or 1,2, 9 in case of mod 100).
Warning: this logic is vulnerable to data skews.
I hope it helped, Cheers.