Pandas dataframe read_csv on bad data

后端 未结 3 1940
庸人自扰
庸人自扰 2020-12-02 22:13

I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro

3条回答
  •  执念已碎
    2020-12-02 22:31

    here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.

    def bad_lines(path):
        import itertools
        num_columns = []
        lines = ""
        
        for i in range(10,50,5):
            content = open(path).readlines(i)[0]
            if (content.count("'") == 0) and (content.count('"') == 0):
                num_columns.append(content.count(","))
    
        if len(set(num_columns)) == 1:
            for line in itertools.islice(open(path), 0, None):
                if line.count(",") >= num_columns[0]:
                    lines = lines + line
    
        text_file = open("temp.txt", "w")
        n = text_file.write(lines)
        text_file.close()
        
        return("temp.txt")
    

提交回复
热议问题