Understanding pandas.read_csv() float parsing

给你一囗甜甜゛ 提交于 2019-12-06 11:18:36

@MaxU already showed the source code for the parser and the relevant tokenizer xstrtod so I'll focus on the "why" part:

The code for xstrtod is roughly like this (translated to pure Python):

def xstrtod(p):
    number = 0.
    idx = 0
    ndecimals = 0

    while p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1

    idx += 1

    while idx < len(p) and p[idx].isdigit():
        number = number * 10. + int(p[idx])
        idx += 1
        ndecimals += 1

    return number / 10**ndecimals

Which reproduces the "problem" you saw:

print(xstrtod('0.99999999999999997'))  # 1.0
print(xstrtod('0.99999999999999998'))  # 1.0
print(xstrtod('0.99999999999999999'))  # 1.0000000000000002
print(xstrtod('1.00000000000000000'))  # 1.0
print(xstrtod('1.00000000000000001'))  # 1.0
print(xstrtod('1.00000000000000002'))  # 1.0
print(xstrtod('1.00000000000000003'))  # 1.0
print(xstrtod('1.00000000000000004'))  # 1.0
print(xstrtod('1.00000000000000005'))  # 1.0
print(xstrtod('1.00000000000000006'))  # 1.0
print(xstrtod('1.00000000000000007'))  # 1.0
print(xstrtod('1.00000000000000008'))  # 1.0
print(xstrtod('1.00000000000000009'))  # 1.0000000000000002
print(xstrtod('1.00000000000000019'))  # 1.0000000000000002

The problem seems to be the 9 in the last place which alters the result. So it's floating point accuracy:

>>> float('100000000000000008')
1e+17
>>> float('100000000000000009')
1.0000000000000002e+17

It's the 9 in the last place that is responsible for the skewed results.


If you want high precision you can define your own converters or use python-provided ones, i.e. decimal.Decimal if you want arbitary precision:

>>> import pandas
>>> import decimal
>>> converter = {0: decimal.Decimal}  # parse column 0 as decimals
>>> import io
>>> def parse(string):
...     return '{:.30f}'.format(pd.read_csv(io.StringIO(string), converters=converter)["column"][0])
>>> print(parse("column\n0.99999999999999998"))
>>> print(parse("column\n0.99999999999999999"))
>>> print(parse("column\n1.00000000000000000"))
>>> print(parse("column\n1.00000000000000001"))
>>> print(parse("column\n1.00000000000000008"))
>>> print(parse("column\n1.00000000000000009"))

which prints:

0.999999999999999980000000000000
0.999999999999999990000000000000
1.000000000000000000000000000000
1.000000000000000010000000000000
1.000000000000000080000000000000
1.000000000000000090000000000000

Exactly representing the input!

If you want to understand how it works - look at the source code - file "_libs/parsers.pyx" lines: 492-499 for Pandas 0.20.1:

    self.parser.double_converter_nogil = xstrtod  # <------- default converter 
    self.parser.double_converter_withgil = NULL
    if float_precision == 'high':
        self.parser.double_converter_nogil = precise_xstrtod # <------- 'high' converter
        self.parser.double_converter_withgil = NULL
    elif float_precision == 'round_trip':  # avoid gh-15140
        self.parser.double_converter_nogil = NULL
        self.parser.double_converter_withgil = round_trip

Source code for xstrtod

Source code for precise_xstrtod

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!