问题
I have a list of tuples as shown below. I have to count how many items have a number greater than 1. The code that I have written so far is very slow. Even if there are around 10K tuples, if you see below example string appears two times, so i have to get such kind of strings. My question is what is the best way to achieve the count of strings here by iterating over the generator
List:
b_data=[('example',123),('example-one',456),('example',987),.....]
My code so far:
blockslst=[]
for line in b_data:
blockslst.append(line[0])
blocklstgtone=[]
for item in blockslst:
if(blockslst.count(item)>1):
blocklstgtone.append(item)
回答1:
You've got the right idea extracting the first item from each tuple. You can make your code more concise using a list/generator comprehension, as I show you below.
From that point on, the most idiomatic manner to find frequency counts of elements is using a collections.Counter
object.
- Extract the first elements from your list of tuples (using a comprehension)
- Pass this to
Counter
- Query count of
example
from collections import Counter
counts = Counter(x[0] for x in b_data)
print(counts['example'])
Sure, you can use list.count
if it’s only one item you want to find frequency counts for, but in the general case, a Counter
is the way to go.
The advantage of a Counter
is it performs frequency counts of all elements (not just example
) in linear (O(N)
) time. Say you also wanted to query the count of another element, say foo
. That would be done with -
print(counts['foo'])
If 'foo'
doesn’t exist in the list, 0
is returned.
If you want to find the most common elements, call counts.most_common
-
print(counts.most_common(n))
Where n
is the number of elements you want to display. If you want to see everything, don't pass n
.
To retrieve counts of most common elements, one efficient way to do this is to query most_common
and then extract all elements with counts over 1, efficiently with itertools
.
from itertools import takewhile
l = [1, 1, 2, 2, 3, 3, 1, 1, 5, 4, 6, 7, 7, 8, 3, 3, 2, 1]
c = Counter(l)
list(takewhile(lambda x: x[-1] > 1, c.most_common()))
[(1, 5), (3, 4), (2, 3), (7, 2)]
(OP edit) Alternatively, use a list comprehension to get a list of items having count > 1 -
[item[0] for item in counts.most_common() if item[-1] > 1]
Keep in mind that this isn’t as efficient as the itertools.takewhile
solution. For example, if you have one item with count > 1, and a million items with count equal to 1, you’d end up iterating over the list a million and one times, when you don’t have to (because most_common
returns frequency counts in descending order). With takewhile
that isn’t the case, because you stop iterating as soon as the condition of count > 1 becomes false.
回答2:
First method :
What about without loop ?
print(list(map(lambda x:x[0],b_data)).count('example'))
output:
2
Second method :
You can calculate using simple dict , without importing any external module or without making it so complex:
b_data = [('example', 123), ('example-one', 456), ('example', 987)]
dict_1={}
for i in b_data:
if i[0] not in dict_1:
dict_1[i[0]]=1
else:
dict_1[i[0]]+=1
print(dict_1)
print(list(filter(lambda y:y!=None,(map(lambda x:(x,dict_1.get(x)) if dict_1.get(x)>1 else None,dict_1.keys())))))
output:
[('example', 2)]
Test_case :
b_data = [('example', 123), ('example-one', 456), ('example', 987),('example-one', 456),('example-one', 456),('example-two', 456),('example-two', 456),('example-two', 456),('example-two', 456)]
output:
[('example-two', 4), ('example-one', 3), ('example', 2)]
回答3:
Time it took me to do this ayodhyankit-paul posted the same - leaving it in non the less for the generator code for testcases and timing:
Creating 100001 items took roughly 5 seconds, counting took about 0.3s, filtering on counts was too fast to measure (with datetime.now() - did not bother with perf_counter) - all in all it took less then 5.1s from start to finish for about 10 times the data you operate on.
I think this similar to what Counter
in COLDSPEEDs answer does:
foreach item
in list of tuples
:
- if
item[0]
not in list, put intodict
withcount of 1
- else
increment count
in dictby 1
Code:
from collections import Counter
import random
from datetime import datetime # good enough for a loong running op
dt_datagen = datetime.now()
numberOfKeys = 100000
# basis for testdata
textData = ["example", "pose", "text","someone"]
numData = [random.randint(100,1000) for _ in range(1,10)] # irrelevant
# create random testdata from above lists
tData = [(random.choice(textData)+str(a%10),random.choice(numData)) for a in range(numberOfKeys)]
tData.append(("aaa",99))
dt_dictioning = datetime.now()
# create a dict
countEm = {}
# put all your data into dict, counting them
for p in tData:
if p[0] in countEm:
countEm[p[0]] += 1
else:
countEm[p[0]] = 1
dt_filtering = datetime.now()
#comparison result-wise (commented out)
#counts = Counter(x[0] for x in tData)
#for c in sorted(counts):
# print(c, " = ", counts[c])
#print()
# output dict if count > 1
subList = [x for x in countEm if countEm[x] > 1] # without "aaa"
dt_printing = datetime.now()
for c in sorted(subList):
if (countEm[c] > 1):
print(c, " = ", countEm[c])
dt_end = datetime.now()
print( "\n\nCreating ", len(tData) , " testdataitems took:\t", (dt_dictioning-dt_datagen).total_seconds(), " seconds")
print( "Putting them into dictionary took \t", (dt_filtering-dt_dictioning).total_seconds(), " seconds")
print( "Filtering donw to those > 1 hits took \t", (dt_printing-dt_filtering).total_seconds(), " seconds")
print( "Printing all the items left took \t", (dt_end-dt_printing).total_seconds(), " seconds")
print( "\nTotal time: \t", (dt_end- dt_datagen).total_seconds(), " seconds" )
Output:
# reformatted for bevity
example0 = 2520 example1 = 2535 example2 = 2415
example3 = 2511 example4 = 2511 example5 = 2444
example6 = 2517 example7 = 2467 example8 = 2482
example9 = 2501
pose0 = 2528 pose1 = 2449 pose2 = 2520
pose3 = 2503 pose4 = 2531 pose5 = 2546
pose6 = 2511 pose7 = 2452 pose8 = 2538
pose9 = 2554
someone0 = 2498 someone1 = 2521 someone2 = 2527
someone3 = 2456 someone4 = 2399 someone5 = 2487
someone6 = 2463 someone7 = 2589 someone8 = 2404
someone9 = 2543
text0 = 2454 text1 = 2495 text2 = 2538
text3 = 2530 text4 = 2559 text5 = 2523
text6 = 2509 text7 = 2492 text8 = 2576
text9 = 2402
Creating 100001 testdataitems took: 4.728604 seconds
Putting them into dictionary took 0.273245 seconds
Filtering donw to those > 1 hits took 0.0 seconds
Printing all the items left took 0.031234 seconds
Total time: 5.033083 seconds
回答4:
Let me give you an example to make you understand.Although this example is very much different than your example, I found it very helpful while solving these type of questions.
from collections import Counter
a = [
(0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"),
(1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"),
(2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"),
(3, "statistics"), (3, "regression"), (3, "probability"),
(4, "machine learning"), (4, "regression"), (4, "decision trees"),
(4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"),
(5, "Haskell"), (5, "programming languages"), (6, "statistics"),
(6, "probability"), (6, "mathematics"), (6, "theory"),
(7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"),
(7, "neural networks"), (8, "neural networks"), (8, "deep learning"),
(8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"),
(9, "Java"), (9, "MapReduce"), (9, "Big Data")
]
#
# 1. Lowercase everything
# 2. Split it into words.
# 3. Count the results.
dictionary = Counter(word for i, j in a for word in j.lower().split())
print(dictionary)
# print out every words if the count > 1
[print(word, count) for word, count in dictionary.most_common() if count > 1]
Now this is your example solved in the above manner
from collections import Counter
a=[('example',123),('example-one',456),('example',987),('example2',987),('example3',987)]
dict = Counter(word for i,j in a for word in i.lower().split() )
print(dict)
[print(word ,count) for word,count in dict.most_common() if count > 1 ]
来源:https://stackoverflow.com/questions/47843707/count-frequency-of-item-in-a-list-of-tuples