问题
I have a csv that looks like (headers = first row):
name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
When I run:
df = pd.read_csv('file.csv')
Columns a and b have a .0 attached to the end like so:
df.head()
name,a,a1,b,b1
arnold,300311.0,arnld01,300311.0,arnld01
sam,300713.0,sam01,300713.0,sam01
Columns a and b are integers or blanks so why does pd.read_csv() treat them like floats and how do I ensure they are integers on the read?
回答1:
As root mentioned in the comments, this is a limitation of Pandas (and Numpy). NaN is a float and the empty values you have in your CSV are NaN.
This is listed in the gotchas of pandas as well.
You can work around this in a few ways.
For the examples below I used the following to import the data - note that I added a row with an empty value in columns a and b
import pandas as pd
from StringIO import StringIO
data = """name,a,a1,b,b1
arnold,300311,arnld01,300311,arnld01
sam,300713,sam01,300713,sam01
test,,test01,,test01"""
df = pd.read_csv(StringIO(data), sep=",")
Drop NaN rows
Your first option is to drop rows that contain this NaN value. The downside of this, is that you lose the entire row. After getting your data into a dataframe, run this:
df.dropna(inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This drops all NaN rows from the dataframe, then it converts column a and column b to an int
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
Fill NaN with placeholder data
This option will replace all your NaN values with a throw away value. That value is something you need to determine. For this test, I made it -999999. This will allow use to keep the rest of the data, convert it to an int, and make it obvious what data is invalid. You'll be able to filter these rows out if you are making calculations based on the columns later.
df.fillna(-999999, inplace=True)
df.a = df.a.astype(int)
df.b = df.b.astype(int)
This produces a dataframe like so:
>>> df.dtypes
name object
a int32
a1 object
b int32
b1 object
dtype: object
>>> df
name a a1 b b1
0 arnold 300311 arnld01 300311 arnld01
1 sam 300713 sam01 300713 sam01
2 test -999999 test01 -999999 test01
Leave the float values
Finally, another choice is to leave the float values (and NaN) and not worry about the non-integer data type.
回答2:
Converting Float to Integer values using Pandas read_csv - Working ====================================================
# Importing the dataset
dataset = pd.read_csv('WorldWarWeather_Data.csv')
X = dataset.iloc[:, 3:11].values
y = dataset.iloc[:, 2].values
X=X.astype(int)
y=y.astype(int)
来源:https://stackoverflow.com/questions/39666308/pd-read-csv-by-default-treats-integers-like-floats