Count frequency of item in a list of tuples

前端 未结 4 1019
梦谈多话
梦谈多话 2020-12-17 15:30

I have a list of tuples as shown below. I have to count how many items have a number greater than 1. The code that I have written so far is very slow. Even if there are arou

相关标签:
4条回答
  • 2020-12-17 16:04

    Let me give you an example to make you understand.Although this example is very much different than your example, I found it very helpful while solving these type of questions.

    from collections import Counter
    
    a = [
    (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
    (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
    (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
    (3, "statistics"), (3, "regression"), (3, "probability"),
    (4, "machine learning"), (4, "regression"), (4, "decision trees"),
    (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
    (5, "Haskell"), (5, "programming languages"), (6, "statistics"),
    (6, "probability"), (6, "mathematics"), (6, "theory"),
    (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
    (7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
    (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
    (9, "Java"), (9, "MapReduce"), (9, "Big Data")
    ]
    # 
    # 1. Lowercase everything
    # 2. Split it into words.
    # 3. Count the results.
    
    dictionary = Counter(word for i, j in a for word in j.lower().split())
    
    print(dictionary)
    
    # print out every words if the count > 1
    [print(word, count) for word, count in dictionary.most_common() if count > 1]
    

    Now this is your example solved in the above manner

    from collections import Counter
    a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)]
    
    dict = Counter(word for i,j in a for word in i.lower().split() )
    
    print(dict)
    
    [print(word ,count) for word,count in dict.most_common() if count > 1  ]
    
    0 讨论(0)
  • 2020-12-17 16:05

    You've got the right idea extracting the first item from each tuple. You can make your code more concise using a list/generator comprehension, as I show you below.

    From that point on, the most idiomatic manner to find frequency counts of elements is using a collections.Counter object.

    1. Extract the first elements from your list of tuples (using a comprehension)
    2. Pass this to Counter
    3. Query count of example
    from collections import Counter
    
    counts = Counter(x[0] for x in b_data)
    print(counts['example'])
    

    Sure, you can use list.count if it’s only one item you want to find frequency counts for, but in the general case, a Counter is the way to go.


    The advantage of a Counter is it performs frequency counts of all elements (not just example) in linear (O(N)) time. Say you also wanted to query the count of another element, say foo. That would be done with -

    print(counts['foo'])
    

    If 'foo' doesn’t exist in the list, 0 is returned.

    If you want to find the most common elements, call counts.most_common -

    print(counts.most_common(n))
    

    Where n is the number of elements you want to display. If you want to see everything, don't pass n.


    To retrieve counts of most common elements, one efficient way to do this is to query most_common and then extract all elements with counts over 1, efficiently with itertools.

    from itertools import takewhile
    
    l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1]
    c = Counter(l)
    
    list(takewhile(lambda x: x[-1] > 1, c.most_common()))
    [(1, 5), (3, 4), (2, 3), (7, 2)]
    

    (OP edit) Alternatively, use a list comprehension to get a list of items having count > 1 -

    [item[0] for item in counts.most_common() if item[-1] > 1]
    

    Keep in mind that this isn’t as efficient as the itertools.takewhile solution. For example, if you have one item with count > 1, and a million items with count equal to 1, you’d end up iterating over the list a million and one times, when you don’t have to (because most_common returns frequency counts in descending order). With takewhile that isn’t the case, because you stop iterating as soon as the condition of count > 1 becomes false.

    0 讨论(0)
  • 2020-12-17 16:06

    Time it took me to do this ayodhyankit-paul posted the same - leaving it in non the less for the generator code for testcases and timing:

    Creating 100001 items took roughly 5 seconds, counting took about 0.3s, filtering on counts was too fast to measure (with datetime.now() - did not bother with perf_counter) - all in all it took less then 5.1s from start to finish for about 10 times the data you operate on.

    I think this similar to what Counter in COLDSPEEDs answer does:

    foreach item in list of tuples:

    • if item[0] not in list, put into dict with count of 1
    • else increment count in dict by 1

    Code:

    from collections import Counter
    import random
    from datetime import datetime # good enough for a loong running op
    
    
    dt_datagen = datetime.now()
    numberOfKeys = 100000 
    
    
    # basis for testdata
    textData = ["example", "pose", "text","someone"]
    numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant
    
    # create random testdata from above lists
    tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)] 
    
    tData.append(("aaa",99))
    
    dt_dictioning = datetime.now()
    
    # create a dict
    countEm = {}
    
    # put all your data into dict, counting them
    for p in tData:
        if p[0] in countEm:
            countEm[p[0]] += 1
        else:
            countEm[p[0]] = 1
    
    dt_filtering = datetime.now()
    #comparison result-wise (commented out)        
    #counts = Counter(x[0] for x in tData)
    #for c in sorted(counts):
    #    print(c, " = ", counts[c])
    #print()  
    # output dict if count > 1
    subList = [x for x in countEm if countEm[x] > 1] # without "aaa"
    
    dt_printing = datetime.now()
    
    for c in sorted(subList):
        if (countEm[c] > 1):
            print(c, " = ", countEm[c])
    
    dt_end = datetime.now()
    
    print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
    print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
    print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
    print( "Printing all the items left took    \t", (dt_end-dt_printing).total_seconds(), " seconds")
    
    print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )
    

    Output:

    # reformatted for bevity
    example0  =  2520       example1  =  2535       example2  =  2415
    example3  =  2511       example4  =  2511       example5  =  2444
    example6  =  2517       example7  =  2467       example8  =  2482
    example9  =  2501
    
    pose0  =  2528          pose1  =  2449          pose2  =  2520      
    pose3  =  2503          pose4  =  2531          pose5  =  2546          
    pose6  =  2511          pose7  =  2452          pose8  =  2538          
    pose9  =  2554
    
    someone0  =  2498       someone1  =  2521       someone2  =  2527
    someone3  =  2456       someone4  =  2399       someone5  =  2487
    someone6  =  2463       someone7  =  2589       someone8  =  2404
    someone9  =  2543
    
    text0  =  2454          text1  =  2495          text2  =  2538
    text3  =  2530          text4  =  2559          text5  =  2523      
    text6  =  2509          text7  =  2492          text8  =  2576      
    text9  =  2402
    
    
    Creating  100001  testdataitems took:    4.728604  seconds
    Putting them into dictionary took        0.273245  seconds
    Filtering donw to those > 1 hits took    0.0  seconds
    Printing all the items left took         0.031234  seconds
    
    Total time:      5.033083  seconds 
    
    0 讨论(0)
  • 2020-12-17 16:18

    First method :

    What about without loop ?

    print(list(map(lambda x:x[0],b_data)).count('example'))
    

    output:

    2
    

    Second method :

    You can calculate using simple dict , without importing any external module or without making it so complex:

    b_data = [('example', 123), ('example-one', 456), ('example', 987)]
    
    dict_1={}
    for i in b_data:
        if i[0] not in dict_1:
            dict_1[i[0]]=1
        else:
            dict_1[i[0]]+=1
    
    print(dict_1)
    
    
    
    print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))
    

    output:

    [('example', 2)]
    

    Test_case :

    b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]
    

    output:

    [('example-two', 4), ('example-one', 3), ('example', 2)]
    
    0 讨论(0)
提交回复
热议问题