Python equivalent of MATLAB's “ismember” function

前端 未结 5 781
孤城傲影
孤城傲影 2020-11-27 07:05

After many attempts trying optimize code, it seems that one last resource would be to attempt to run the code below using multiple cores. I don\'t know exactly how to conver

相关标签:
5条回答
  • 2020-11-27 07:39

    sfstewman's excellent answer most likely solved the issue for you.

    I'd just like to add how you can achieve the same exclusively in numpy.

    I make use of numpy's unique an in1d functions.

    B_unique_sorted, B_idx = np.unique(B, return_index=True)
    B_in_A_bool = np.in1d(B_unique_sorted, A, assume_unique=True)
    
    • B_unique_sorted contains the unique values in B sorted.
    • B_idx holds for these values the indices into the original B.
    • B_in_A_bool is a boolean array the size of B_unique_sorted that stores whether a value in B_unique_sorted is in A.
      Note: I need to look for (unique vals from B) in A because I need the output to be returned with respect to B_idx
      Note: I assume that A is already unique.

    Now you can use B_in_A_bool to either get the common vals

    B_unique_sorted[B_in_A_bool]
    

    and their respective indices in the original B

    B_idx[B_in_A_bool]
    

    Finally, I assume that this is significantly faster than the pure Python for-loop although I didn't test it.

    0 讨论(0)
  • 2020-11-27 07:39

    Try using a list comprehension;

    In [1]: import numpy as np
    
    In [2]: A = np.array([3,4,4,3,6])
    
    In [3]: B = np.array([2,5,2,6,3])
    
    In [4]: [x for x in A if not x in B]
    Out[4]: [4, 4]
    

    Generally, list comprehensions are much faster than for-loops.

    To get an equal length-list;

    In [19]: map(lambda x: x if x not in B else False, A)
    Out[19]: [False, 4, 4, False, False]
    

    This is quite fast for small datasets:

    In [20]: C = np.arange(10000)
    
    In [21]: D = np.arange(15000, 25000)
    
    In [22]: %timeit map(lambda x: x if x not in D else False, C)
    1 loops, best of 3: 756 ms per loop
    

    For large datasets, you could try using a multiprocessing.Pool.map() to speed up the operation.

    0 讨论(0)
  • 2020-11-27 07:42

    Here is the exact MATLAB equivalent that returns both the output arguments [Lia, Locb] that match MATLAB except in Python 0 is also a valid index. So, this function doesn't return the 0s. It essentially returns Locb(Locb>0). The performance is also equivalent to MATLAB.

    def ismember(a_vec, b_vec):
        """ MATLAB equivalent ismember function """
    
        bool_ind = np.isin(a_vec,b_vec)
        common = a[bool_ind]
        common_unique, common_inv  = np.unique(common, return_inverse=True)     # common = common_unique[common_inv]
        b_unique, b_ind = np.unique(b_vec, return_index=True)  # b_unique = b_vec[b_ind]
        common_ind = b_ind[np.isin(b_unique, common_unique, assume_unique=True)]
        return bool_ind, common_ind[common_inv]
    

    An alternate implementation that is a bit (~5x) slower but doesn't use the unique function is here:

    def ismember(a_vec, b_vec):
        ''' MATLAB equivalent ismember function. Slower than above implementation'''
        b_dict = {b_vec[i]: i for i in range(0, len(b_vec))}
        indices = [b_dict.get(x) for x in a_vec if b_dict.get(x) is not None]
        booleans = np.in1d(a_vec, b_vec)
        return booleans, np.array(indices, dtype=int)
    
    0 讨论(0)
  • 2020-11-27 07:52

    Before worrying about multiple cores, I would eliminate the linear scan in your ismember function by using a dictionary:

    def ismember(a, b):
        bind = {}
        for i, elt in enumerate(b):
            if elt not in bind:
                bind[elt] = i
        return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value
    

    Your original implementation requires a full scan of the elements in B for each element in A, making it O(len(A)*len(B)). The above code requires one full scan of B to generate the dict Bset. By using a dict, you effectively make the lookup of each element in B constant for each element of A, making the operation O(len(A)+len(B)). If this is still too slow, then worry about making the above function run on multiple cores.

    Edit: I've also modified your indexing slightly. Matlab uses 0 because all of its arrays start at index 1. Python/numpy start arrays at 0, so if you're data set looks like this

    A = [2378, 2378, 2378, 2378]
    B = [2378, 2379]
    

    and you return 0 for no element, then your results will exclude all elements of A. The above routine returns None for no index instead of 0. Returning -1 is an option, but Python will interpret that to be the last element in the array. None will raise an exception if it's used as an index into the array. If you'd like different behavior, change the second argument in the Bind.get(item,None) expression to the value you want returned.

    0 讨论(0)
  • 2020-11-27 08:06

    Try the ismember library.

    pip install ismember
    

    Simple example:

    # Import library
    from ismember import ismember
    import numpy as np
    
    # data
    A = np.array([3,4,4,3,6])
    B = np.array([2,5,2,6,3])
    
    # Lookup
    Iloc,idx = ismember(A, B)
     
    # Iloc is boolean defining existence of d in d_unique
    print(Iloc)
    # [ True False False  True  True]
    
    # indexes of d_unique that exists in d
    print(idx)
    # [4 4 3]
    
    print(B[idx])
    # [3 3 6]
    
    print(A[Iloc])
    # [3 3 6]
    
    # These vectors will match
    A[Iloc]==B[idx]
    

    Speed check:

    from ismember import ismember
    from datetime import datetime
    
    t1=[]
    t2=[]
    # Create some random vectors
    ns = np.random.randint(10,10000,1000)
    
    for n in ns:
        a_vec = np.random.randint(0,100,n)
        b_vec = np.random.randint(0,100,n)
    
        # Run stack version
        start = datetime.now()
        out1=ismember_stack(a_vec, b_vec)
        end = datetime.now()
        t1.append(end - start)
    
        # Run ismember
        start = datetime.now()
        out2=ismember(a_vec, b_vec)
        end = datetime.now()
        t2.append(end - start)
    
    
    print(np.sum(t1))
    # 0:00:07.778331
    
    print(np.sum(t2))
    # 0:00:04.609801
    
    # %%
    def ismember_stack(a, b):
        bind = {}
        for i, elt in enumerate(b):
            if elt not in bind:
                bind[elt] = i
        return [bind.get(itm, None) for itm in a]  # None can be replaced by any other "not in b" value
    

    The ismember function from pypi is almost 2x faster.

    Large vectors, eg 700000 elements:

    from ismember import ismember
    from datetime import datetime
    
    A = np.random.randint(0,100,700000)
    B = np.random.randint(0,100,700000)
    
    # Lookup
    start = datetime.now()
    Iloc,idx = ismember(A, B)
    end = datetime.now()
    
    # Print time
    print(end-start)
    # 0:00:01.194801
    
    0 讨论(0)
提交回复
热议问题