I\'m trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:
df = pd.DataFrame(index=[\'pbp\
Create Empty Dataframe in Pandas specifying column types
I think this is perfect!!
import pandas as pd
c1 = pd.Series(data=None, dtype='string', name='c1')
c2 = pd.Series(data=None, dtype='bool', name='c2')
c3 = pd.Series(data=None, dtype='float', name='c3')
c4 = pd.Series(data=None, dtype='int', name='c4')
df = pd.concat([c1, c2, c3, c4], axis=1)
df.info('verbose')
We create columns as Series and give them the correct dtype, then we concat de Series into a DataFrame, and that's it
We have the DataFrame constructor with dtypes!
<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 c1 0 non-null string
1 c2 0 non-null bool
2 c3 0 non-null float64
3 c4 0 non-null int32
dtypes: bool(1), float64(1), int32(1), string(1)
memory usage: 0.0+ bytes
You can do this by passing a dictionary into the DataFrame constructor:
df = pd.DataFrame(index=['pbp'],
data={'contract' : np.full(1, "", dtype=str),
'starting_membership' : np.full(1, np.nan, dtype=float),
'projected_membership' : np.full(1, np.nan, dtype=int)
}
)
This will correctly give you a dataframe that looks like:
contract projected_membership starting_membership
pbp "" NaN -9223372036854775808
With dtypes:
contract object
projected_membership float64
starting_membership int64
That said, there are two things to note:
1) str
isn't actually a type that a DataFrame column can handle; instead it falls back to the general case object
. It'll still work properly.
2) Why don't you see NaN
under starting_membership
? Well, NaN
is only defined for floats; there is no "None" value for integers, so it casts np.NaN
to an integer. If you want a different default value, you can change that in the np.full
call.
You can do it like this
import numpy
import pandas
dtypes = numpy.dtype([
('a', str),
('b', int),
('c', float),
('d', numpy.datetime64),
])
data = numpy.empty(0, dtype=dtypes)
df = pandas.DataFrame(data)
You can use the following:
df = pd.DataFrame({'a': pd.Series([], dtype='int'),
'b': pd.Series([], dtype='str'),
'c': pd.Series([], dtype='float')})
then if you call df you have
>>> df
Empty DataFrame
Columns: [a, b, c]
Index: []
and if you check its types
>>> df.dtypes
a int32
b object
c float64
dtype: object
I found the easiest workaround for me was to simply concatenate a list of empty series for each individual column:
import pandas as pd
columns = ['contract',
'state_and_county_code',
'state',
'county',
'starting_membership',
'starting_raw_raf',
'enrollment_trend',
'projected_membership',
'projected_raf']
dtype = ['str', 'str', 'str', 'str', 'int', 'float', 'float', 'int', 'float']
df = pd.concat([pd.Series(name=col, dtype=dt) for col, dt in zip(columns, dtype)], axis=1)
df.info()
# <class 'pandas.core.frame.DataFrame'>
# Index: 0 entries
# Data columns (total 9 columns):
# contract 0 non-null object
# state_and_county_code 0 non-null object
# state 0 non-null object
# county 0 non-null object
# starting_membership 0 non-null int32
# starting_raw_raf 0 non-null float64
# enrollment_trend 0 non-null float64
# projected_membership 0 non-null int32
# projected_raf 0 non-null float64
# dtypes: float64(3), int32(2), object(4)
# memory usage: 0.0+ bytes