I have a csv-file with a column with strings and I want to read it with pandas. In this file the string null
occurs as an actual value and should not be regarde
The reason this happens is that the string 'null'
is treated as NaN
on parsing, you can turn this off by passing keep_default_na=False
in addition to @coldspeed's answer:
In[49]:
data = u'strings,numbers\nfoo,1\nbar,2\nnull,3'
df = pd.read_csv(io.StringIO(data), keep_default_na=False)
df
Out[49]:
strings numbers
0 foo 1
1 bar 2
2 null 3
The full list is:
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
UPDATE: 2020-03-23 for Pandas 1+:
many thanks to @aiguofer for the adapted solution:
na_vals = pd.io.parsers.STR_NA_VALUES.difference({'NULL','null'})
df = pd.read_csv(io.StringIO(data), na_values=na_vals, keep_default_na=False)
Old answer:
we can dynamically exclude 'NULL'
and 'null'
from the set of default _NA_VALUES
:
In [4]: na_vals = pd.io.common._NA_VALUES.difference({'NULL','null'})
In [5]: na_vals
Out[5]:
{'',
'#N/A',
'#N/A N/A',
'#NA',
'-1.#IND',
'-1.#QNAN',
'-NaN',
'-nan',
'1.#IND',
'1.#QNAN',
'N/A',
'NA',
'NaN',
'n/a',
'nan'}
and use it in read_csv()
:
df = pd.read_csv(io.StringIO(data), na_values=na_vals)
Other answers are better for reading in a csv without "null" being interpreted as Nan
, but if you have a dataframe that you want "fixed", this code will do so: df=df.fillna('null')
You can specify a converters
argument for the string
column.
pd.read_csv(StringIO(data), converters={'strings' : str})
strings numbers
0 foo 1
1 bar 2
2 null 3
This will by-pass pandas' automatic parsing.
Another option is setting na_filter=False
:
pd.read_csv(StringIO(data), na_filter=False)
strings numbers
0 foo 1
1 bar 2
2 null 3
This works for the entire DataFrame, so use with caution. I recommend first option if you want to surgically apply this to select columns instead.