I want to read in a very large csv (cannot be opened in excel and edited easily) but somewhere around the 100,000th row, there is a row with one extra column causing the pro
here is my way to solve those problem, it is slow but works so well, Simply says just read the CSV file as txt file, and go through each line. if the "," comma is less than it should be just skip that row. eventurally safe the correct lines.
def bad_lines(path):
import itertools
num_columns = []
lines = ""
for i in range(10,50,5):
content = open(path).readlines(i)[0]
if (content.count("'") == 0) and (content.count('"') == 0):
num_columns.append(content.count(","))
if len(set(num_columns)) == 1:
for line in itertools.islice(open(path), 0, None):
if line.count(",") >= num_columns[0]:
lines = lines + line
text_file = open("temp.txt", "w")
n = text_file.write(lines)
text_file.close()
return("temp.txt")