可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a csv file which isn't coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

import pandas as pd csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5""" f = open('foo.csv', 'w') f.write(csv) f.close()  df1 = pd.read_csv('foo.csv',          index_col=["date", "loc"],          usecols=["dummy", "date", "loc", "x"],         parse_dates=["date"],         header=0,         names=["dummy", "date", "loc", "x"]) print df1  # Ignore the dummy columns df2 = pd.read_csv('foo.csv',          index_col=["date", "loc"],          usecols=["date", "loc", "x"], # <----------- Changed         parse_dates=["date"],         header=0,         names=["dummy", "date", "loc", "x"]) print df2

I expect that df1 and df2 should be the same except for the missing dummy column, but the columns come in mislabeled. Also the date is getting parsed as a date.

In [118]: %run test.py                dummy  x date       loc 2009-01-01 a     bar  1 2009-01-02 a     bar  3 2009-01-03 a     bar  5 2009-01-01 b     bar  1 2009-01-02 b     bar  3 2009-01-03 b     bar  5               date date loc a    1    20090101      3    20090102      5    20090103 b    1    20090101      3    20090102      5    20090103

Using column numbers instead of names give me the same problem. I can workaround the issue by dropping the dummy column after the read_csv step, but I'm trying to understand what is going wrong. I'm using pandas 0.10.1.

edit: fixed bad header usage.

回答1:

The answer by @chip completely misses the point of two keyword arguments.

names is only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

This solution corrects those oddities:

import pandas as pd from StringIO import StringIO  csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5"""  df = pd.read_csv(StringIO(csv),         header=0,         index_col=["date", "loc"],          usecols=["date", "loc", "x"],         parse_dates=["date"])

Which gives us:

                x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

回答2:

This code achieves what you want --- also its weird and certainly buggy:

I observed that it works when:

a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

b) same for parse_dates

c) not so for usecols ;) for obvious reasons

d) here I adapted the names to mirror this behaviour

import pandas as pd from StringIO import StringIO  csv = """dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5 """  df = pd.read_csv(StringIO(csv),         index_col=[0,1],         usecols=[1,2,3],          parse_dates=[0],         header=0,         names=["date", "loc", "", "x"])  print df

which prints

                x date       loc    2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

回答3:

If your csv file contains extra data, columns can be deleted from the DataFrame after import.

import pandas as pd from StringIO import StringIO  csv = r"""dummy,date,loc,x bar,20090101,a,1 bar,20090102,a,3 bar,20090103,a,5 bar,20090101,b,1 bar,20090102,b,3 bar,20090103,b,5"""  df = pd.read_csv(StringIO(csv),         index_col=["date", "loc"],          usecols=["dummy", "date", "loc", "x"],         parse_dates=["date"],         header=0,         names=["dummy", "date", "loc", "x"]) del df['dummy']

Which gives us:

                x date       loc 2009-01-01 a    1 2009-01-02 a    3 2009-01-03 a    5 2009-01-01 b    1 2009-01-02 b    3 2009-01-03 b    5

回答4:

import csv first and use csv.DictReader its easy to process...

文章来源: pandas read_csv and filter columns with usecols

标签

pandas

date

csv

bar