Pandas: How to workaround “error tokenizing data”?

前端未结

关注

 4  1785

盖世英雄少女心 2021-02-03 10:51

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed

4条回答

自闭症患者 (楼主)

2021-02-03 11:21
Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)
```
### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code
```
df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.
```
   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None
```
If you write this again to a file via:

df.to_csv("Test.tab",sep="\t",header=False,index=False)
```
1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4           
```
None will be converted to empty string '' and everything is fine.

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.
```
1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...