I have a csv file that has a few hundred rows and 26 columns, but the last few columns only have a value in a few rows and they are towards the middle or end of the file. Wh
You can use names
parameter. For example, if you have csv file like this:
1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6
And try to read it, you'll receive and error
>>> pd.read_csv(r'D:/Temp/tt.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6
But if you pass names
parameters, you'll get result:
>>> pd.read_csv(r'D:/Temp/tt.csv', names=list('abcdef'))
a b c d e f
0 1 2 1 NaN NaN NaN
1 2 3 4 2 3 NaN
2 1 2 3 3 NaN NaN
3 1 2 3 4 5 6
Hope it helps.
Suppose you have a file like this:
a,b,c
1,2,3
1,2,3,4
You could use csv.reader
to clean the file first,
lines=list(csv.reader(open('file.csv')))
header, values = lines[0], lines[1:]
data = {h:v for h,v in zip (header, zip(*values))}
and get:
{'a' : ('1','1'), 'b': ('2','2'), 'c': ('3', '3')}
If you don't have header you could use:
data = {h:v for h,v in zip (str(xrange(number_of_columns)), zip(*values))}
and then you can convert dictionary to dataframe with
import pandas as pd
df = pd.DataFrame.from_dict(data)
The problem with the given solution is that you have to know the max number of columns required. I couldn't find a direct function for this problem, but you can surely write a def which can:
Here is the def (function) I wrote for my files:
def ragged_csv(filename):
f=open(filename)
max_n=0
for line in f.readlines():
words = len(line.split(' '))
if words > max_n:
max_n=words
lines=pd.read_csv(filename,sep=' ',names=range(max_n))
return lines
you can also load the CSV with separator '^', to load the entire string to a column, then use split to break the string into required delimiters. After that, you do a concat to merge with the original dataframe (if needed).
temp=pd.read_csv('test.csv',sep='^',header=None,prefix='X')
temp2=temp.X0.str.split(',',expand=True)
del temp['X0']
temp=pd.concat([temp,temp2],axis=1)