Delete repeating list elements preserving order of appearance

后端 未结 3 548
我在风中等你
我在风中等你 2020-12-03 06:14

I am producing flat lists with 10^6 to 10^7 Real numbers, and some of them are repeating.

I need to delete the repeating instances, keeping the first occurrence o

3条回答
  •  执念已碎
    2020-12-03 06:27

    Not to compete with other answers, but I just could not help sharing a Compile - based solution. The solution is based on building a binary search tree, and then checking for every number in the list, whether its index in the list is the one used in building the b-tree. If yes, it is the original number, if no - it is a duplicate. What makes this solution interesting for me is that it shows a way to emulate "pass-by-reference" with Compile. The point is that, if we inline compiled functions into other Compiled functions (and that can be achieved with an "InlineCompiledFunctions" option), we can refer in inner functions to the variables defined in outer function scope (because of the way inlining works). This is not a true pass-by-reference, but it still allows to combine functions from smaller blocks, without efficiency penalty (this is more in the spirit of macro-expnsion). I don't think this is documented, and have no idea whether this will stay in future versions. Anyways, here is the code:

    (* A function to build a binary tree *)
    Block[{leftchildren , rightchildren},
    makeBSearchTree = 
    Compile[{{lst, _Real, 1}},
    Module[{len = Length[lst], ctr = 1, currentRoot = 1},
     leftchildren = rightchildren =  Table[0, {Length[lst]}];
     For[ctr = 1, ctr <= len, ctr++,
      For[currentRoot = 1, lst[[ctr]] != lst[[currentRoot]],(* 
       nothing *),
       If[lst[[ctr]] < lst[[currentRoot]],
        If[leftchildren[[currentRoot]] == 0,
         leftchildren[[currentRoot]] = ctr;
         Break[],
         (* else *)
         currentRoot = leftchildren[[currentRoot]] ],
        (* else *)
        If[rightchildren[[currentRoot]] == 0,
         rightchildren[[currentRoot]] = ctr;
         Break[],
         (* else *)
         currentRoot = rightchildren[[currentRoot]]]]]];
     ], {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 1}},
    CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
    CompilationOptions -> {"ExpressionOptimization" -> True}]];
    
    
    (* A function to query the binary tree and check for a duplicate *)
    Block[{leftchildren , rightchildren, lst},
    isDuplicate = 
    Compile[{{index, _Integer}},
    Module[{currentRoot = 1, result = True},
     While[True,
      Which[
       lst[[index]] == lst[[currentRoot]],
        result = index != currentRoot;
        Break[],
       lst[[index]] < lst[[currentRoot]],
        currentRoot = leftchildren[[currentRoot]],
       True,
        currentRoot = rightchildren[[currentRoot]]
       ]];
     result
     ],
    {{leftchildren, _Integer, 1}, {rightchildren, _Integer, 
      1}, {lst, _Real, 1}},
    CompilationTarget -> "C", "RuntimeOptions" -> "Speed",
    CompilationOptions -> {"ExpressionOptimization" -> True}
    ]];
    
    
    (* The main function *)
    Clear[deldup];
    deldup = 
    Compile[{{lst, _Real, 1}},
      Module[{len = Length[lst], leftchildren , rightchildren , 
         nodup = Table[0., {Length[lst]}], ndctr = 0, ctr = 1},
    makeBSearchTree[lst]; 
    For[ctr = 1, ctr <= len, ctr++,
     If[! isDuplicate [ctr],
      ++ndctr;
       nodup[[ndctr]] =  lst[[ctr]]
      ]];
    Take[nodup, ndctr]], CompilationTarget -> "C", 
    "RuntimeOptions" -> "Speed",
    CompilationOptions -> {"ExpressionOptimization" -> True,
     "InlineCompiledFunctions" -> True, 
     "InlineExternalDefinitions" -> True}];
    

    Here are some tests:

    In[61]:= intTst = N@RandomInteger[{0,500000},1000000];
    
    In[62]:= (res1 = deldup[intTst ])//Short//Timing
    Out[62]= {1.141,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
    
    In[63]:= (res2 = Tally[intTst,Equal][[All,1]])//Short//Timing
    Out[63]= {0.64,{260172.,421188.,487754.,259397.,<<432546>>,154340.,295707.,197588.,119996.}}
    
    In[64]:= res1==res2
    Out[64]= True
    

    Not as fast as the Tally version, but also Equal - based, and as I said, my point was to illustrate an interesting (IMO) technique.

提交回复
热议问题