I need the cross-mapped indicies for numpy union and intersection operations. The code I have below works fine, but I would like to vectorize it before I apply it to large data
For cases like these, you might want to convert the strings into numerals, as working with them is far more efficient. Also, given the fact that the outputs are numeric arrays, it makes more sense to have them as numeric IDs upfront. Now, for this conversion to numeric IDs, I have seen people using lambda
among other approaches, but I would go with np.unique, which is quite efficient for cases like these. Here's the implementation starting with the numeric ID conversion -
# ------------------------ Setup work -------------------------------
_,idx1 = np.unique(np.append(A,B),return_inverse=True)
A_ID = idx1[:A.size]
B_ID = idx1[A.size:]
# ------------------------ Union work -------------------------------
# Get length of zc, which would be the max of ID+1.
lenC = idx1.max()+1
# Initialize output array zc and fill with NaNs.
zc1 = np.empty((lenC,3,))
zc1[:]=np.nan
# Fill first column with consecutive numbers starting with 0
zc1[:,0] = range(0,lenC)
# Most important part of the code :
# Set the cols-1,2 at places specified by IDs from A and B respectively
# with values from 0 to the extent of the respective IDs
zc1[A_ID,1] = np.arange(A_ID.size)
zc1[B_ID,2] = np.arange(B_ID.size)
# ------------------------ Intersection work -------------------------------
# Get intersecting indices between A and B
intersect_ID = np.argwhere(A_ID[:,None] == B_ID)
# Initialize output zd based on the number of interesects
lenD = intersect_ID.shape[0]
zd1 = np.empty((lenD,3,))
zd1[:] = np.nan
# Fill first column with consecutive numbers starting with 0
zd1[:,0] = range(0,lenD)
zd1[:,1:] = intersect_ID