Best way to sort 1M records in Python

前端 未结 11 1612
慢半拍i
慢半拍i 2020-12-05 08:41

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { \'hits\':{}, \'misses\':{},          


        
11条回答
  •  挽巷
    挽巷 (楼主)
    2020-12-05 09:27

    I've done some quick profiling of both the original way and SLott's proposal. In neither case does it take 5-10 minutes per field. The actual sorting is not the problem. It looks like most of the time is spent in slinging data around and transforming it. Also, my memory usage is skyrocketing - my python is over 350 megs of ram! are you sure you're not using up all your ram and paging to disk? Even with my crappy 3 year old power saving processor laptop, I am seeing results way less than 5-10 minutes per key sorted for a million items. What I can't explain is the variability in the actual sort() calls. I know python sort is extra good at sorting partially sorted lists, so maybe his list is getting partially sorted in the transform from the raw data to the list to be sorted.

    Here's the results for slott's method:

    done creating data
    done transform.  elapsed: 16.5160000324
    sorting one key slott's way takes 1.29699993134
    

    here's the code to get those results:

    starttransform = time.time()
    hits= [ (r['hits'],r['id']) for r in myList ]
    endtransform = time.time()
    print "done transform.  elapsed: " + str(endtransform - starttransform)
    hits.sort()
    endslottsort = time.time()
    print "sorting one key slott's way takes " + str(endslottsort - endtransform)
    

    Now the results for the original method, or at least a close version with some instrumentation added:

    done creating data
    done transform.  elapsed: 8.125
    about to get stuff to be sorted 
    done getting data. elapsed time: 37.5939998627
    about to sort key hits
    done  sorting on key  elapsed time: 5.54699993134
    

    Here's the code:

    for k, v in myLists.iteritems():
        time1 = time.time()
        print "about to get stuff to be sorted "
        tobesorted = myLists[k].items()
        time2 = time.time()
        print "done getting data. elapsed time: " + str(time2-time1)
        print "about to sort key " + str(k) 
        mysorted[k] = tobesorted.sort( key=itemgetter(1))
        time3 = time.time()
        print "done  sorting on key <" + str(k) + "> elapsed time: " + str(time3-time2)
    

提交回复
热议问题