String formatting issue (parantheses vs underline)

前端 未结 2 1395
南笙
南笙 2021-01-21 20:57

I got a text file containing all my data

data = \'B:/tempfiles/bla.dat\'

from the text file I\'m listing the column header and their types with

2条回答
  •  梦谈多话
    2021-01-21 21:29

    When you have problems with genfromtxt the first thing you should do is print the shape and dtype.

    Why do you have to use () in col_headers = [('VW_3_Avg','?

    Is it because the file has those names in the header?

    If you are giving your own dtype and using skip_header it doesn't matter what's on the file. It's the field names in the dtype that count, not the ones on the file.

    We could dig in to the dtype documentation and find just what characters are allowed. Field names that would work as Python variable names certainly will work. I'm not surprised the () would be disallowed or have problems, though I haven't tested that.


    Actually 'Lvl_Max(1)' is acceptable as a dtype field name:

    In [235]: col_headers = [('VW_3_Avg','

    What you should have done, right from the start, is show us datafile.shape and datafile.dtype. 90% of these genfromtxt problems stem from a misunderstanding of the function returns.


    Let's try a simple fileread with this dtype:

    In [239]: txt=b"""1 2
       .....: 3 4
       .....: 5 6
       .....: """
    In [240]: np.genfromtxt(txt.splitlines(),dtype=col_headers)
    Out[240]: 
    array([(1.0, 2.0), (3.0, 4.0), (5.0, 6.0)], 
          dtype=[('VW_3_Avg', '

    Look at the dtype. genfromtxt has stripped off the '(1)'. Looks like genfromtxt 'sanitizes' the field names, no doubt because names on text file could have all kinds of funny stuff.

    From the genfromtxt docs:

    Numpy arrays with a structured dtype can also be viewed as recarray, where a field can be accessed as if it were an attribute. For that reason, we may need to make sure that the field name doesn’t contain any space or invalid character, or that it does not correspond to the name of a standard attribute (like size or shape), which would confuse the interpreter.


    genfromtxt takes a deletechars parameter that should let you control which characters are deleted from the field names. But it's application is inconsistent.

    In [282]: np.genfromtxt(txt.splitlines(),names=np.dtype(col_headers).names,deletechars=set(b' '),dtype=None)
    Out[282]: 
    array([(1, 2), (3, 4), (5, 6)], 
          dtype=[('VW_3_Avg', '

    dtype=None is required for this to work.

    The default set is large:

    defaultdeletechars = set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")
    

    The problem is that deletechars is passed to the validator:

    validate_names = NameValidator(...
                                   deletechars=deletechars,...)
    

    which is used to clean names from the header and the names parameter. But then the names (and dtype) are passed through

    dtype = easy_dtype(dtype, defaultfmt=defaultfmt, names=names)
    

    without the deletechars parameter. This issue was addressed about a year ago, https://github.com/numpy/numpy/pull/4649, so may be fixed in new(est) versions.

提交回复
热议问题