Insert item into case-insensitive sorted list in Python

不问归期 提交于 2019-12-10 13:18:33

问题


I have a list of strings that is already sorted in case-insensitive order. I would like to insert a new string into the list. One way to do this is to append the item and then sort the list, like so:

myList.append('Something')
myList.sort(key=lambda s: s.lower())

But I was wondering if there is a way to just insert the item in the correct position without sorting the whole thing again.

I found this question: Insert an item into a sorted list in Python. It points towards Python's bisect module. But that module doesn't look like it can support case-insensitivity.


Edit: I tested out several of the answers listed here.

  • Appending the item to the end and sorting the whole list (as proposed in the original question) was the slowest.
  • Moinuddin Quadri's answer was faster than sorting the whole list, but it was still pretty slow, due to running lower() on every item on the list.
  • Stefan Pochmann's answer was an order of magnitude faster than sorting the whole list.
  • Jared Goguen's answer was the fastest for repeated insertions. The first time, though, it runs lower() on every element.

It was a close call to accept an answer. In the end, I went with Stefan Pochmann's answer because it was the best for a one-time insertion, and accessing the resulting list does not require accessing a member variable. However, use cases vary, so be sure to examine all the answers.


回答1:


It's a good opportunity to practice binary search again (or just copy&paste&modify bisect.insort, which is what I did):

def insort_case_insensitive(a, x):
    key = x.lower()
    lo, hi = 0, len(myList)
    while lo < hi:
        mid = (lo + hi) // 2
        if key < a[mid].lower():
            hi = mid
        else:
            lo = mid + 1
    a.insert(lo, x)

Demo:

myList = ['a', 'b', 'c', 'd', 'e']
for x in 'A', 'B', 'C', 'D', 'E':
    insort_case_insensitive(myList, x)
print myList

Prints:

['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E']

It's O(n), just like append+sort would be, but only because of the a.insert(lo, x) at the end. Which is dead-simple and done in C, so it's super fast. The binary search of course only takes O(log n) steps, so that's super fast as well. The append+sort way would call .lower() on all elements and compare them, both of which is much slower. @MoinuddinQuadri's first solution is also much slower because of calling .lower() on all elements.

See my other answer for a benchmarking comparison.




回答2:


You may use bisect.bisect on the lower-cased sorted list as:

from bisect import bisect
my_list = ["aa", "bb", "Dd", "ee"]
insert_string = "CC"

#                 convert all the items in list to lower case for
#               v finding the correct location via. bisect
index = bisect([i.lower() for i in my_list], insert_string.lower())
#                bisect based on lower-cased string for  ^
#                case-insensitive behavior

my_list.insert(index, insert_string)

where updated content of my_list will be:

['aa', 'bb', 'CC', 'Dd', 'ee']



回答3:


You could create your own type to encapsulate this behavior (combined with bisect as suggested in another answer).

from bisect import bisect

class CaseInsensitiveSortedList:
    def __init__(self, iterable):
        self.with_case = list(sorted(iterable, key=lambda s: s.lower()))
        self.without_case = [s.lower() for s in self.with_case]

    def insert_in_order(self, s):
        s_lower = s.lower()
        index = bisect(self.without_case, s_lower)
        self.without_case.insert(index, s_lower)
        self.with_case.insert(index, s)


test_list = CaseInsensitiveSortedList(['a', 'B', 'cd', 'E', 'fff'])

test_list.insert_in_order('D')
print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'fff']

test_list.insert_in_order('ee')
print(test_list.with_case) # ['a', 'B', 'cd', 'D', 'E', 'ee', 'fff']

You could extend list directly and make this a little "more natural" or do whatever else you want with it. This is just an idea to avoid calling str.lower on every element for every insert.




回答4:


I have never worked with bisect but here is my stab at it. The first function I take directly from the bisect page that you linked:

def index(a, x):
    'Locate the leftmost value exactly equal to x'
    i = bisect_left(a, x)
    if i != len(a) and a[i] == x:
        return i
    raise ValueError

def insert_into_sorted(char, my_list):
    marker = chr(ord(char) + 1)
    spot = index(my_list, marker)
    my_list[spot:spot] = char
    return my_list

x = ['a', 'b', 'd', 'e']

insert_into_sorted('c', x)

>>['a', 'b', 'c', 'd', 'e']



回答5:


Based on comment from OP, since it is OK to maintain the intermediate list to hold lower-cased strings (one time operation); it could be achieved as:

from bisect import bisect
my_list = ["aa", "bb", "Dd", "ee"]

lower_list = [i.lower() for i in my_list]  # list of lower-cased strings.
                                           # one time operation
insert_string = "CC"  # word to insert

# find index based on lower-cased list
index = bisect(lower_list, insert_string.lower())

my_list.insert(index, insert_string)  # insert word in original list
lower_list.insert(index, insert_string.lower())   # insert lower-cased word
                                                  # in lower-cased list

where final value of my_list will be lower_list will be:

>>> my_list   # original list
['aa', 'bb', 'CC', 'Dd', 'ee']

>>> lower_list
['aa', 'bb', 'cc', 'dd', 'ee']

Here we are bisecting the list of lower-cased words to find the index and based on the index, inserting the string into original list.




回答6:


I see you added test results to you question. I did some benchmarks as well now and get a similar picture:

Insorting 20000 words:
 80.224 seconds with insort_sorting
  0.166 seconds with insort_own_binary_search
 70.294 seconds with insort_lower_all
  0.094 seconds with insort_keep_lower

You're somewhat wrong about the fast two, though. With larger numbers of insertions, mine becomes faster. About twice as fast:

Insorting 1000000 words:
 92.712 seconds with insort_own_binary_search
173.577 seconds with insort_keep_lower

That's because the O(log n) time for searching the index becomes negligible and the time is dominated by the O(n) time for the insert calls. And my solution has only one of those while the other solution has two.

Another difference is space complexity, keeping an extra list of lowered versions of all strings isn't as good.

Here's my benchmarking code:

import random, string, time

#--------------------------------------------------------------
def insort_sorting(a, x):
    a.append(x)
    a.sort(key=str.lower)
#--------------------------------------------------------------
def insort_own_binary_search(a, x):
    key = x.lower()
    lo, hi = 0, len(myList)
    while lo < hi:
        mid = (lo + hi) // 2
        if key < a[mid].lower():
            hi = mid
        else:
            lo = mid + 1
    a.insert(lo, x)
#--------------------------------------------------------------
from bisect import bisect
def insort_lower_all(a, x):
    index = bisect([i.lower() for i in a], x.lower())
    a.insert(index, x)
#--------------------------------------------------------------
from bisect import bisect
def insort_keep_lower(a, x, lower=[]):
    x_lower = x.lower()
    index = bisect(lower, x_lower)
    a.insert(index, x)
    lower.insert(index, x_lower)
#--------------------------------------------------------------

# Generate random words    
words = [''.join(random.choice(string.ascii_letters) for _ in range(10))
         for _ in range(20000)]
#         for _ in range(1000000)]

# Compare the solutions
print 'Insorting', len(words), 'words:'
reference = None
for insort in insort_sorting, insort_own_binary_search, insort_lower_all, insort_keep_lower:
#for insort in insort_own_binary_search, insort_keep_lower:
    t0 = time.time()
    myList = []
    for word in words:
        insort(myList, word)
    print '%7.3f seconds with %s' % (time.time() - t0, insort.__name__)
    if reference is None:
        reference = myList
    else:
        assert myList == reference


来源:https://stackoverflow.com/questions/41902958/insert-item-into-case-insensitive-sorted-list-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!