I have a list of tuples as shown below. I have to count how many items have a number greater than 1. The code that I have written so far is very slow. Even if there are arou
Let me give you an example to make you understand.Although this example is very much different than your example, I found it very helpful while solving these type of questions.
from collections import Counter
a = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
#
# 1. Lowercase everything
# 2. Split it into words.
# 3. Count the results.
dictionary = Counter(word for i, j in a for word in j.lower().split())
print(dictionary)
# print out every words if the count > 1
[print(word, count) for word, count in dictionary.most_common() if count > 1]
Now this is your example solved in the above manner
from collections import Counter
a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)]
dict = Counter(word for i,j in a for word in i.lower().split() )
print(dict)
[print(word ,count) for word,count in dict.most_common() if count > 1 ]
You've got the right idea extracting the first item from each tuple. You can make your code more concise using a list/generator comprehension, as I show you below.
From that point on, the most idiomatic manner to find frequency counts of elements is using a collections.Counter
object.
Counter
example
from collections import Counter
counts = Counter(x[0] for x in b_data)
print(counts['example'])
Sure, you can use list.count
if it’s only one item you want to find frequency counts for, but in the general case, a Counter
is the way to go.
The advantage of a Counter
is it performs frequency counts of all elements (not just example
) in linear (O(N)
) time. Say you also wanted to query the count of another element, say foo
. That would be done with -
print(counts['foo'])
If 'foo'
doesn’t exist in the list, 0
is returned.
If you want to find the most common elements, call counts.most_common
-
print(counts.most_common(n))
Where n
is the number of elements you want to display. If you want to see everything, don't pass n
.
To retrieve counts of most common elements, one efficient way to do this is to query most_common
and then extract all elements with counts over 1, efficiently with itertools
.
from itertools import takewhile
l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1]
c = Counter(l)
list(takewhile(lambda x: x[-1] > 1, c.most_common()))
[(1, 5), (3, 4), (2, 3), (7, 2)]
(OP edit) Alternatively, use a list comprehension to get a list of items having count > 1 -
[item[0] for item in counts.most_common() if item[-1] > 1]
Keep in mind that this isn’t as efficient as the itertools.takewhile
solution. For example, if you have one item with count > 1, and a million items with count equal to 1, you’d end up iterating over the list a million and one times, when you don’t have to (because most_common
returns frequency counts in descending order). With takewhile
that isn’t the case, because you stop iterating as soon as the condition of count > 1 becomes false.
Time it took me to do this ayodhyankit-paul posted the same - leaving it in non the less for the generator code for testcases and timing:
Creating 100001 items took roughly 5 seconds, counting took about 0.3s, filtering on counts was too fast to measure (with datetime.now() - did not bother with perf_counter) - all in all it took less then 5.1s from start to finish for about 10 times the data you operate on.
I think this similar to what Counter
in COLDSPEEDs answer does:
foreach item
in list of tuples
:
item[0]
not in list, put into dict
with count of 1
increment count
in dict by 1
Code:
from collections import Counter
import random
from datetime import datetime # good enough for a loong running op
dt_datagen = datetime.now()
numberOfKeys = 100000
# basis for testdata
textData = ["example", "pose", "text","someone"]
numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant
# create random testdata from above lists
tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)]
tData.append(("aaa",99))
dt_dictioning = datetime.now()
# create a dict
countEm = {}
# put all your data into dict, counting them
for p in tData:
if p[0] in countEm:
countEm[p[0]] += 1
else:
countEm[p[0]] = 1
dt_filtering = datetime.now()
#comparison result-wise (commented out)
#counts = Counter(x[0] for x in tData)
#for c in sorted(counts):
# print(c, " = ", counts[c])
#print()
# output dict if count > 1
subList = [x for x in countEm if countEm[x] > 1] # without "aaa"
dt_printing = datetime.now()
for c in sorted(subList):
if (countEm[c] > 1):
print(c, " = ", countEm[c])
dt_end = datetime.now()
print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
print( "Printing all the items left took \t", (dt_end-dt_printing).total_seconds(), " seconds")
print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )
Output:
# reformatted for bevity
example0 = 2520 example1 = 2535 example2 = 2415
example3 = 2511 example4 = 2511 example5 = 2444
example6 = 2517 example7 = 2467 example8 = 2482
example9 = 2501
pose0 = 2528 pose1 = 2449 pose2 = 2520
pose3 = 2503 pose4 = 2531 pose5 = 2546
pose6 = 2511 pose7 = 2452 pose8 = 2538
pose9 = 2554
someone0 = 2498 someone1 = 2521 someone2 = 2527
someone3 = 2456 someone4 = 2399 someone5 = 2487
someone6 = 2463 someone7 = 2589 someone8 = 2404
someone9 = 2543
text0 = 2454 text1 = 2495 text2 = 2538
text3 = 2530 text4 = 2559 text5 = 2523
text6 = 2509 text7 = 2492 text8 = 2576
text9 = 2402
Creating 100001 testdataitems took: 4.728604 seconds
Putting them into dictionary took 0.273245 seconds
Filtering donw to those > 1 hits took 0.0 seconds
Printing all the items left took 0.031234 seconds
Total time: 5.033083 seconds
First method :
What about without loop ?
print(list(map(lambda x:x[0],b_data)).count('example'))
output:
2
Second method :
You can calculate using simple dict , without importing any external module or without making it so complex:
b_data = [('example', 123), ('example-one', 456), ('example', 987)]
dict_1={}
for i in b_data:
if i[0] not in dict_1:
dict_1[i[0]]=1
else:
dict_1[i[0]]+=1
print(dict_1)
print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))
output:
[('example', 2)]
Test_case :
b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]
output:
[('example-two', 4), ('example-one', 3), ('example', 2)]