Python Pandas Error tokenizing data

后端 未结 30 2708
不知归路
不知归路 2020-11-22 04:49

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

30条回答
  •  無奈伤痛
    2020-11-22 05:25

    I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

    1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
    1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
    368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""
    
    
    
    import pandas as pd
    # Same error for read_table
    counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')
    
    pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
    

    This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

    counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')
    
    Segmentation fault (core dumped)
    

    Now that is a different error.
    If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

    1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
    1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
    368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""
    
    
    _csv.Error: '   ' expected after '"'
    

    And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

    To avoid creating a new file with replacements I did this, as my tables are small:

    from io import StringIO
    with open(path_counts) as f:
        input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
        counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')
    

    tl;dr
    Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

提交回复
热议问题