numpy genfromtxt - how to detect bad int input values

跟風遠走 提交于 2021-01-29 06:49:30

问题


Here is a trivial example of a bad int value to numpy.genfromtxt. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.

>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()

My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt, I cannot detect this bad value.

>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out

masked_array(data=[(0, -1), (1, 2), (3, 4)],
         mask=[(False, False), (False, False), (False, False)],
   fill_value=(999999, 999999),
        dtype=[('a', '<i8'), ('b', '<i8')])

>>> out['b'].data
array([-1,  2,  4])

I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input

>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()

I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?


回答1:


I found in np.lib._iotools a function

def _loose_call(self, value):
    try:
        return self.func(value)
    except ValueError:
        return self.default

When genfromtxt is processing a line it does

if loose:
    rows = list(
        zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
              for (i, conv) in enumerate(converters)]))

where loose is an input parameter. So in the case of int converter it tries

int(astring)

and if that produces a ValueError it returns the default value (e.g. -1) instead of raising an error. Similarly for float and np.nan.

The usemask parameter is applied in:

        if usemask:
            append_to_masks(tuple([v.strip() in m
                                   for (v, m) in zip(values,
                                                     missing_values)]))

Define 2 converters to give more information on what's processed:

def myint(astr):
    try:
        v = int(astr)
    except ValueError:
        print('err',astr)
        v = '-999'
    return v

def myfloat(astr):
    try:
        v = float(astr)
    except ValueError:
        print('err',astr)
        v = '-inf'
    return v

A sample text:

txt='''1,2
3,nan
,foo
bar,
'''.splitlines()

And using the converters:

In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]: 
array([(   1,   2.), (   3,  nan), (-999, -inf), (-999, -inf)],
      dtype=[('f0', '<i8'), ('f1', '<f8')])

And to see what usemask does:

In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]: 
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
             mask=[(False, False), (False, False), ( True, False),
                   (False,  True)],
       fill_value=(999999, 1.e+20),
            dtype=[('f0', '<i8'), ('f1', '<f8')])

A missing value is a '' string, and int('') produces a ValueError just as int('bad') does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask.



来源:https://stackoverflow.com/questions/65317590/numpy-genfromtxt-how-to-detect-bad-int-input-values

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!