问题
I can get remove all nan in x
numpy array and from related y
array with a mask or
y = y[~np.isnan(x)]
x = x[~np.isnan(x)]
Now, I need only remove parts when there are many (let's say 20 NaNs in a row). Does anyone know how to handle this issue?
回答1:
There's a bit of ambiguity in the question, but regardless, it'll be nice to answer both versions. I'm not sure if you meant that you need to remove sections where there are more than 20 consecutive NaNs on 1D data, or if you meant that you need to remove rows from 2D data such that there are more than 20 NaNs (anywhere) in the row. The latter has already been answered by Tai, so I'll answer the former.
The idea here is to find out what indices the NaNs are at, and then group these indices into streaks where they occur consecutively, filter out the streaks that aren't long enough, and finally construct a mask with the remaining streaks/indices (whew).
import numpy as np
# Construct some test data
x = np.arange(150, dtype=np.float)
x[20:50] = np.NaN # remove this streak
x[70:80] = np.NaN # keep this streak
x[105:140] = np.NaN # remove this streak
x[149] = np.NaN # keep this lone soldier
print("Original (with long streaks): ", x)
# Calculate streaks, filter out streaks that are too short, apply global mask
nan_spots = np.where(np.isnan(x))
diff = np.diff(nan_spots)[0]
streaks = np.split(nan_spots[0], np.where(diff != 1)[0]+1)
long_streaks = set(np.hstack([streak for streak in streaks if len(streak) > 20]))
mask = [item not in long_streaks for item in range(len(x))]
print("Filtered (without long streaks): ", x[mask])
assert len(x[mask]) == len(x) - (50 - 20) - (140-105)
Outputs:
Original (with long streaks): [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16. 17. 18. 19. nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan 50. 51. 52. 53. 54. 55.
56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69.
nan nan nan nan nan nan nan nan nan nan 80. 81. 82. 83.
84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97.
98. 99. 100. 101. 102. 103. 104. nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan
140. 141. 142. 143. 144. 145. 146. 147. 148. nan]
Filtered (without long streaks): [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16. 17. 18. 19. 50. 51. 52. 53. 54. 55. 56. 57.
58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. nan nan
nan nan nan nan nan nan nan nan 80. 81. 82. 83. 84. 85.
86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99.
100. 101. 102. 103. 104. 140. 141. 142. 143. 144. 145. 146. 147. 148.
nan]
And if need be, just apply the same mask to y
(i.e. y = y[mask]
). You can generalize this to many dimensional data, but you'll have to pick the axis you want to find the consecutive NaNs along.
来源:https://stackoverflow.com/questions/51124540/remove-part-of-an-array-when-nan-sequence-20-in-a-row