Intersecting texts to find common words

后端 未结 1 775
天命终不由人
天命终不由人 2021-02-09 07:40

I\'m trying to find out which would be the most optimal way of intersection a set of texts and find the common words in them. Given this scenario:

var t1 = \'My          


        
1条回答
  •  难免孤独
    2021-02-09 08:14

    Here's a tested solution :

    function intersect() {
       var set = {};
       [].forEach.call(arguments, function(a,i){
         var tokens = a.match(/\w+/g);
         if (!i) {
           tokens.forEach(function(t){ set[t]=1 });
         } else {
           for (var k in set){
             if (tokens.indexOf(k)<0) delete set[k];
           }
         }
       });
       return Object.keys(set);
    }
    

    This function is variadic, you can call it with any number of texts :

    console.log(intersect(t1, t2, t3)) // -> ["Mary"] 
    
    console.log(intersect(t1, t2)) // -> ["Mary", "from"] 
    
    console.log(intersect()) // -> [] 
    

    If you need to support non English languages, then this regex won't be enough because of the poor support of Unicode in JavaScript regexes. Either you use a regex library or you define your regex by explicitly excluding characters as in a.match(/[^\s\-.,!?]+/g); (this will probably be enough for you) .


    Detailed explanation :

    The idea is to fill a set with the tokens of the first text and then remove from the set the tokens missing in the other texts.

    1. The set is a JavaScript object used as a map. Some purists would have used Object.create(null) to avoid a prototype, I like the simplicity of {}.
    2. As I want my function to be variadic, I use arguments instead of defining the passed texts as explicit arguments.
    3. arguments isn't a real array, so to iterate over it you need either a for loop or a trick like [].forEach.call. It works because arguments is "array-like".
    4. To tokenize, I simply use match to match words, nothing special here (see note above regarding better support of other languages, though)
    5. I use !i to check if it's the first text. In that case, I simply copy the tokens as properties in the set. A value must be used, I use 1. In the future, ES6 sets will make the intent more obvious here.
    6. For the following texts, I iterate over the elements of the sets (the keys) and I remove the ones which are not in the array of tokens (tokens.indexOf(k)<0)
    7. Finally, I return the elements of the sets because we want an array. The simplest solution is to use Object.keys.

    0 讨论(0)
提交回复
热议问题