remove part of an array when nan sequence > 20 in a row

自闭症网瘾萝莉.ら 提交于 2019-12-08 13:14:49

问题


I can get remove all nan in x numpy array and from related y array with a mask or

y = y[~np.isnan(x)]
x = x[~np.isnan(x)]

Now, I need only remove parts when there are many (let's say 20 NaNs in a row). Does anyone know how to handle this issue?


回答1:


There's a bit of ambiguity in the question, but regardless, it'll be nice to answer both versions. I'm not sure if you meant that you need to remove sections where there are more than 20 consecutive NaNs on 1D data, or if you meant that you need to remove rows from 2D data such that there are more than 20 NaNs (anywhere) in the row. The latter has already been answered by Tai, so I'll answer the former.

The idea here is to find out what indices the NaNs are at, and then group these indices into streaks where they occur consecutively, filter out the streaks that aren't long enough, and finally construct a mask with the remaining streaks/indices (whew).

import numpy as np

# Construct some test data
x = np.arange(150, dtype=np.float)
x[20:50] = np.NaN # remove this streak                                                                                                                                                                      
x[70:80] = np.NaN # keep this streak                                                                                                                                                                        
x[105:140] = np.NaN # remove this streak                                                                                                                                                                    
x[149] = np.NaN # keep this lone soldier                                                                                                                                                                    
print("Original (with long streaks): ", x)

# Calculate streaks, filter out streaks that are too short, apply global mask
nan_spots = np.where(np.isnan(x))
diff = np.diff(nan_spots)[0]
streaks = np.split(nan_spots[0], np.where(diff != 1)[0]+1)
long_streaks = set(np.hstack([streak for streak in streaks if len(streak) > 20]))
mask = [item not in long_streaks for item in range(len(x))]
print("Filtered (without long streaks): ", x[mask])

assert len(x[mask]) == len(x) - (50 - 20) - (140-105)

Outputs:

Original (with long streaks):  [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  50.  51.  52.  53.  54.  55.
  56.  57.  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  80.  81.  82.  83.
  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.
  98.  99. 100. 101. 102. 103. 104.  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan
 140. 141. 142. 143. 144. 145. 146. 147. 148.  nan]

Filtered (without long streaks):  [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
  14.  15.  16.  17.  18.  19.  50.  51.  52.  53.  54.  55.  56.  57.
  58.  59.  60.  61.  62.  63.  64.  65.  66.  67.  68.  69.  nan  nan
  nan  nan  nan  nan  nan  nan  nan  nan  80.  81.  82.  83.  84.  85.
  86.  87.  88.  89.  90.  91.  92.  93.  94.  95.  96.  97.  98.  99.
 100. 101. 102. 103. 104. 140. 141. 142. 143. 144. 145. 146. 147. 148.
  nan]

And if need be, just apply the same mask to y (i.e. y = y[mask]). You can generalize this to many dimensional data, but you'll have to pick the axis you want to find the consecutive NaNs along.



来源:https://stackoverflow.com/questions/51124540/remove-part-of-an-array-when-nan-sequence-20-in-a-row

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!