Pandas dataframe read_csv on bad data

后端未结

关注

 3  1940

庸人自扰 2020-12-02 22:13

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro

3条回答

执念已碎 (楼主)

2020-12-02 22:31

here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.

def bad_lines(path):
    import itertools
    num_columns = []
    lines = ""
    
    for i in range(10,50,5):
        content = open(path).readlines(i)[0]
        if (content.count("'") == 0) and (content.count('"') == 0):
            num_columns.append(content.count(","))

    if len(set(num_columns)) == 1:
        for line in itertools.islice(open(path), 0, None):
            if line.count(",") >= num_columns[0]:
                lines = lines + line

    text_file = open("temp.txt", "w")
    n = text_file.write(lines)
    text_file.close()
    
    return("temp.txt")

0 讨论(0)

查看其它3个回答