Count frequency of item in a list of tuples

前端 未结 4 1021
梦谈多话
梦谈多话 2020-12-17 15:30

I have a list of tuples as shown below. I have to count how many items have a number greater than 1. The code that I have written so far is very slow. Even if there are arou

4条回答
  •  攒了一身酷
    2020-12-17 16:06

    Time it took me to do this ayodhyankit-paul posted the same - leaving it in non the less for the generator code for testcases and timing:

    Creating 100001 items took roughly 5 seconds, counting took about 0.3s, filtering on counts was too fast to measure (with datetime.now() - did not bother with perf_counter) - all in all it took less then 5.1s from start to finish for about 10 times the data you operate on.

    I think this similar to what Counter in COLDSPEEDs answer does:

    foreach item in list of tuples:

    • if item[0] not in list, put into dict with count of 1
    • else increment count in dict by 1

    Code:

    from collections import Counter
    import random
    from datetime import datetime # good enough for a loong running op
    
    
    dt_datagen = datetime.now()
    numberOfKeys = 100000 
    
    
    # basis for testdata
    textData = ["example", "pose", "text","someone"]
    numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant
    
    # create random testdata from above lists
    tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)] 
    
    tData.append(("aaa",99))
    
    dt_dictioning = datetime.now()
    
    # create a dict
    countEm = {}
    
    # put all your data into dict, counting them
    for p in tData:
        if p[0] in countEm:
            countEm[p[0]] += 1
        else:
            countEm[p[0]] = 1
    
    dt_filtering = datetime.now()
    #comparison result-wise (commented out)        
    #counts = Counter(x[0] for x in tData)
    #for c in sorted(counts):
    #    print(c, " = ", counts[c])
    #print()  
    # output dict if count > 1
    subList = [x for x in countEm if countEm[x] > 1] # without "aaa"
    
    dt_printing = datetime.now()
    
    for c in sorted(subList):
        if (countEm[c] > 1):
            print(c, " = ", countEm[c])
    
    dt_end = datetime.now()
    
    print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
    print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
    print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
    print( "Printing all the items left took    \t", (dt_end-dt_printing).total_seconds(), " seconds")
    
    print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )
    

    Output:

    # reformatted for bevity
    example0  =  2520       example1  =  2535       example2  =  2415
    example3  =  2511       example4  =  2511       example5  =  2444
    example6  =  2517       example7  =  2467       example8  =  2482
    example9  =  2501
    
    pose0  =  2528          pose1  =  2449          pose2  =  2520      
    pose3  =  2503          pose4  =  2531          pose5  =  2546          
    pose6  =  2511          pose7  =  2452          pose8  =  2538          
    pose9  =  2554
    
    someone0  =  2498       someone1  =  2521       someone2  =  2527
    someone3  =  2456       someone4  =  2399       someone5  =  2487
    someone6  =  2463       someone7  =  2589       someone8  =  2404
    someone9  =  2543
    
    text0  =  2454          text1  =  2495          text2  =  2538
    text3  =  2530          text4  =  2559          text5  =  2523      
    text6  =  2509          text7  =  2492          text8  =  2576      
    text9  =  2402
    
    
    Creating  100001  testdataitems took:    4.728604  seconds
    Putting them into dictionary took        0.273245  seconds
    Filtering donw to those > 1 hits took    0.0  seconds
    Printing all the items left took         0.031234  seconds
    
    Total time:      5.033083  seconds 
    

提交回复
热议问题