Pandas: Location of a row with error

后端 未结 3 1632
逝去的感伤
逝去的感伤 2020-12-11 15:13

I am pretty new to Pandas and trying to find out where my code breaks. Say, I am doing a type conversion:

df[\'x\']=df[\'x\'].astype(\'int\')
相关标签:
3条回答
  • To report all rows which fails to map due to any exception:

    df.apply(my_function)  # throws various exceptions at unknown rows
    
    # print Exceptions, index, and row content
    for i, row in enumerate(df):
        try:
            my_function(row)
        except Exception as e: 
            print('Error at index {}: {!r}'.format(i, row))
            print(e)
    
    0 讨论(0)
  • 2020-12-11 15:48

    The error you are seeing might be due to the value(s) in the x column being strings:

    In [15]: df = pd.DataFrame({'x':['1.0692e+06']})
    In [16]: df['x'].astype('int')
    ValueError: invalid literal for long() with base 10: '1.0692e+06'
    

    Ideally, the problem can be avoided by making sure the values stored in the DataFrame are already ints not strings when the DataFrame is built. How to do that depends of course on how you are building the DataFrame.

    After the fact, the DataFrame could be fixed using applymap:

    import ast
    df = df.applymap(ast.literal_eval).astype('int')
    

    but calling ast.literal_eval on each value in the DataFrame could be slow, which is why fixing the problem from the beginning is the best alternative.


    Usually you could drop to a debugger when an exception is raised to inspect the problematic value of row.

    However, in this case the exception is happening inside the call to astype, which is a thin wrapper around C-compiled code. The C-compiled code is doing the looping through the values in df['x'], so the Python debugger is not helpful here -- it won't allow you to introspect on what value the exception is being raised from within the C-compiled code.

    There are many important parts of Pandas and NumPy written in C, C++, Cython or Fortran, and the Python debugger will not take you inside those non-Python pieces of code where the fast loops are handled.

    So instead I would revert to a low-brow solution: iterate through the values in a Python loop and use try...except to catch the first error:

    df = pd.DataFrame({'x':['1.0692e+06']})
    for i, item in enumerate(df['x']):
       try:
          int(item)
       except ValueError:
          print('ERROR at index {}: {!r}'.format(i, item))
    

    yields

    ERROR at index 0: '1.0692e+06'
    
    0 讨论(0)
  • 2020-12-11 15:50

    I hit the same problem, and as I have a big input file (3 million rows), enumerating all rows will take a long time. Therefore I wrote a binary-search to locate the offending row.

    import pandas as pd
    import sys
    
    def binarySearch(df, l, r, func):
        while l <= r:
            mid = l + (r - l) // 2;
    
            result = func(df, mid, mid+1)
            if result:
                # Check if we hit exception at mid
                return mid, result
    
            result = func(df, l, mid)
            if result is None:
                # If no exception at left, ignore left half
                l = mid + 1
            else:
                r = mid - 1
    
        # If we reach here, then the element was not present
        return -1
    
    def check(df, start, end):
        result = None
    
        try:
            # In my case, I want to find out which row cause this failure
            df.iloc[start:end].uid.astype(int)
        except Exception as e:
            result = str(e)
    
        return result
    
    df = pd.read_csv(sys.argv[1])
    
    index, result = binarySearch(df, 0, len(df), check)
    print("index: {}".format(index))
    print(result)
    
    0 讨论(0)
提交回复
热议问题