Simply speaking, how to apply quantile normalization on a large Pandas dataframe (probably 2,000,000 rows) in Python?
PS. I know that there is a package named rpy2 w
The code below gives identical result as preprocessCore::normalize.quantiles.use.target and I find it simpler clearer than the solutions above. Also performance should be good up to huge array lengths.
import numpy as np
def quantile_normalize_using_target(x, target):
"""
Both `x` and `target` are numpy arrays of equal lengths.
"""
target_sorted = np.sort(target)
return target_sorted[x.argsort().argsort()]
Once you have a pandas.DataFrame easy to do:
quantile_normalize_using_target(df[0].as_matrix(),
df[1].as_matrix())
(Normalizing the first columnt to the second one as a reference distribution in the example above.)