问题
I am working on importing CSV files with numpy.genfromtxt
.
The data to be imported has a header of column names, and some of those column names contain characters that genfromtxt
considers invalid. Specifically, some of the names contain "#" and " ". The input data cannot be changed as it is generated by other sources that I do not control.
Using names=True
and comments=None
, I am unable to bring in all of the column names that I need.
I've tried overriding numpy.lib.NameValidator.deletechars=None
, but this does not affect the NameValidator class instance that is actually in use.
I understand that deletechars
exists due to the recarray potential to access a field as if it were an attribute. However, I simply must be able to read in column names that include invalid characters, even if the characters are stripped off when read in.
Is there a way to force the NameValidator
to not check for invalid characters, or to modify the characters it checks for? I am unable to modify numpy/lib/_iotools.py as I am not root and it would be bad to modify a shared installation.
回答1:
You do not explicitly state that numpy.genfromtxt is a hard requirement, so let me suggest that you try asciitable.
This module has a way to replace certain entries before parsing: http://cxc.harvard.edu/contrib/asciitable/#replace-bad-or-missing-values
And you can also define your own readers based on the existing ones: http://cxc.harvard.edu/contrib/asciitable/#advanced-table-reading
The output of asciitable reader are numpy arrays, so you should be able to replace the functions you currently use more or less directly with asciitable.
回答2:
NameValidator
will use its default set for deletechars
if constructed with deletechars=None
, but if you pass in a non-None
set then it will use that. And np.genfromtext
takes a deletechars
parameter which it passes to NameValidator
.
So, you should be able to write
np.genfromtxt(..., deletechars=set())
for an empty set, or some subset of the default set("""~!@#$%^&*()-=+~\|]}[{';: /?.>,<""")
:
deletechars = np.lib._iotools.NameValidator.defaultdeletechars - set("# ")
np.genfromtxt(..., deletechars=deletechars)
回答3:
IMHO, genfromtxt
is often used in cases where some simpler solutions would do.
So, unless you have some troublesome datasets (missing entries, multiple unknown column types), you're better off coding a quick and dirty parser (ie, skip some rows, parse the header, read the rest and reorganize at the end).
Now, if you really need genfromtxt
, @ecatmur pointed justly that the deletechars
argument of genfromtxt
is sent to _iotools.NameValidator
to constructs the set of characters to delete. Using deletechars=None
tells NameValidator
to use a default set. A first thing to try is to just not use deletechars=None
, but an empty set
or ''
.
Note that no matter what, double quotes "
and ending spaces will be deleted and similar names will be differentiated:
>>> fields = ["blah", "'blah'", "\"blah\"", "#blah", "blah "]
>>> np.lib._iotools.NameValidator(deletechars='').validate(fields)
... ('blah', "'blah'", 'blah_1', '#blah', 'blah_2')
The third and last entries would result in three columns named blah
, so we have to rename them.
If this doesn't suit you, I'm afraid you're hitting a block: there's no current way to tell genfromtxt
to accept a customized NameValidator
. That could be a good idea, though, so you may want to raise the point on numpy's mailing list.
来源:https://stackoverflow.com/questions/11840322/python-numpy-genfromtxt-need-column-names-that-contain-invalid-characters