I have a list of tuples as shown below. I have to count how many items have a number greater than 1. The code that I have written so far is very slow. Even if there are arou
Time it took me to do this ayodhyankit-paul posted the same - leaving it in non the less for the generator code for testcases and timing:
Creating 100001 items took roughly 5 seconds, counting took about 0.3s, filtering on counts was too fast to measure (with datetime.now() - did not bother with perf_counter) - all in all it took less then 5.1s from start to finish for about 10 times the data you operate on.
I think this similar to what Counter
in COLDSPEEDs answer does:
foreach item
in list of tuples
:
item[0]
not in list, put into dict
with count of 1
increment count
in dict by 1
Code:
from collections import Counter
import random
from datetime import datetime # good enough for a loong running op
dt_datagen = datetime.now()
numberOfKeys = 100000
# basis for testdata
textData = ["example", "pose", "text","someone"]
numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant
# create random testdata from above lists
tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)]
tData.append(("aaa",99))
dt_dictioning = datetime.now()
# create a dict
countEm = {}
# put all your data into dict, counting them
for p in tData:
if p[0] in countEm:
countEm[p[0]] += 1
else:
countEm[p[0]] = 1
dt_filtering = datetime.now()
#comparison result-wise (commented out)
#counts = Counter(x[0] for x in tData)
#for c in sorted(counts):
# print(c, " = ", counts[c])
#print()
# output dict if count > 1
subList = [x for x in countEm if countEm[x] > 1] # without "aaa"
dt_printing = datetime.now()
for c in sorted(subList):
if (countEm[c] > 1):
print(c, " = ", countEm[c])
dt_end = datetime.now()
print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
print( "Printing all the items left took \t", (dt_end-dt_printing).total_seconds(), " seconds")
print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )
Output:
# reformatted for bevity
example0 = 2520 example1 = 2535 example2 = 2415
example3 = 2511 example4 = 2511 example5 = 2444
example6 = 2517 example7 = 2467 example8 = 2482
example9 = 2501
pose0 = 2528 pose1 = 2449 pose2 = 2520
pose3 = 2503 pose4 = 2531 pose5 = 2546
pose6 = 2511 pose7 = 2452 pose8 = 2538
pose9 = 2554
someone0 = 2498 someone1 = 2521 someone2 = 2527
someone3 = 2456 someone4 = 2399 someone5 = 2487
someone6 = 2463 someone7 = 2589 someone8 = 2404
someone9 = 2543
text0 = 2454 text1 = 2495 text2 = 2538
text3 = 2530 text4 = 2559 text5 = 2523
text6 = 2509 text7 = 2492 text8 = 2576
text9 = 2402
Creating 100001 testdataitems took: 4.728604 seconds
Putting them into dictionary took 0.273245 seconds
Filtering donw to those > 1 hits took 0.0 seconds
Printing all the items left took 0.031234 seconds
Total time: 5.033083 seconds