Create Empty Dataframe in Pandas specifying column types

前端 未结 11 1938
萌比男神i
萌比男神i 2020-11-28 08:39

I\'m trying to create an empty data frame with an index and specify the column types. The way I am doing it is the following:

df = pd.DataFrame(index=[\'pbp\         


        
相关标签:
11条回答
  • 2020-11-28 08:46

    This really smells like a bug.

    Here's another (simpler) solution.

    import pandas as pd
    import numpy as np
    
    def df_empty(columns, dtypes, index=None):
        assert len(columns)==len(dtypes)
        df = pd.DataFrame(index=index)
        for c,d in zip(columns, dtypes):
            df[c] = pd.Series(dtype=d)
        return df
    
    df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
    print(list(df.dtypes)) # int64, int64
    
    0 讨论(0)
  • 2020-11-28 08:49

    pandas doesn't offer pure integer column. You can either use float column and convert that column to integer as needed or treat it like an object. What you are trying to implement is not the way pandas is supposed to be used. But if you REALLY REALLY want that, you can get around the TypeError message by doing this.

    df1 =  pd.DataFrame(index=['pbp'], columns=['str1','str2','str2'], dtype=str)
    df2 =  pd.DataFrame(index=['pbp'], columns=['int1','int2'], dtype=int)
    df3 =  pd.DataFrame(index=['pbp'], columns=['flt1','flt2'], dtype=float)
    df = pd.concat([df1, df2, df3], axis=1)
    
        str1 str2 str2 int1 int2  flt1  flt2
    pbp  NaN  NaN  NaN  NaN  NaN   NaN   NaN
    

    You can rearrange the col order as you like. But again, this is not the way pandas was supposed to be used.

     df.dtypes
    str1     object
    str2     object
    str2     object
    int1     object
    int2     object
    flt1    float64
    flt2    float64
    dtype: object
    

    Note that int is treated as object.

    0 讨论(0)
  • 2020-11-28 08:55

    This is an old question, but I don't see a solid answer (although @eric_g was super close).

    You just need to create an empty dataframe with a dictionary of key:value pairs. The key being your column name, and the value being an empty data type.

    So in your example dataset, it would look as follows (pandas 0.25 and python 3.7):

    variables = {'contract':'',
                 'state_and_county_code':'',
                 'state':'',
                 'county':'',
                 'starting_membership':int(),
                 'starting_raw_raf':float(),
                 'enrollment_trend':float(),
                 'projected_membership':int(),
                 'projected_raf':float()}
    
    df = pd.DataFrame(variables, index=[])
    

    In old pandas versions, one may have to do:

    df = pd.DataFrame(columns=[variables])
    
    0 讨论(0)
  • 2020-11-28 09:00

    Just a remark.

    You can get around the Type Error using np.dtype:

    pd.DataFrame(index = ['pbp'], columns = ['a','b'], dtype = np.dtype([('str','float')]))
    

    but you get instead:

    NotImplementedError: compound dtypes are not implementedin the DataFrame constructor
    
    0 讨论(0)
  • 2020-11-28 09:02

    I found this question after running into the same issue. I prefer the following solution (Python 3) for creating an empty DataFrame with no index.

    import numpy as np
    import pandas as pd
    
    def make_empty_typed_df(dtype):
        tdict = np.typeDict
        types = tuple(tdict.get(t, t) for (_, t, *__) in dtype)
        if any(t == np.void for t in types):
            raise NotImplementedError('Not Implemented for columns of type "void"')
        return pd.DataFrame.from_records(np.array([tuple(t() for t in types)], dtype=dtype)).iloc[:0, :]
    

    Testing this out ...

    from itertools import chain
    
    dtype = [('col%d' % i, t) for i, t in enumerate(chain(np.typeDict, set(np.typeDict.values())))]
    dtype = [(c, t) for (c, t) in dtype if (np.typeDict.get(t, t) != np.void) and not isinstance(t, int)]
    
    print(make_empty_typed_df(dtype))
    

    Out:

    Empty DataFrame
    
    Columns: [col0, col6, col16, col23, col24, col25, col26, col27, col29, col30, col31, col32, col33, col34, col35, col36, col37, col38, col39, col40, col41, col42, col43, col44, col45, col46, col47, col48, col49, col50, col51, col52, col53, col54, col55, col56, col57, col58, col60, col61, col62, col63, col64, col65, col66, col67, col68, col69, col70, col71, col72, col73, col74, col75, col76, col77, col78, col79, col80, col81, col82, col83, col84, col85, col86, col87, col88, col89, col90, col91, col92, col93, col95, col96, col97, col98, col99, col100, col101, col102, col103, col104, col105, col106, col107, col108, col109, col110, col111, col112, col113, col114, col115, col117, col119, col120, col121, col122, col123, col124, ...]
    Index: []
    
    [0 rows x 146 columns]
    

    And the datatypes ...

    print(make_empty_typed_df(dtype).dtypes)
    

    Out:

    col0      timedelta64[ns]
    col6               uint16
    col16              uint64
    col23                int8
    col24     timedelta64[ns]
    col25                bool
    col26           complex64
    col27               int64
    col29             float64
    col30                int8
    col31             float16
    col32              uint64
    col33               uint8
    col34              object
    col35          complex128
    col36               int64
    col37               int16
    col38               int32
    col39               int32
    col40             float16
    col41              object
    col42              uint64
    col43              object
    col44               int16
    col45              object
    col46               int64
    col47               int16
    col48              uint32
    col49              object
    col50              uint64
                   ...       
    col144              int32
    col145               bool
    col146            float64
    col147     datetime64[ns]
    col148             object
    col149             object
    col150         complex128
    col151    timedelta64[ns]
    col152              int32
    col153              uint8
    col154            float64
    col156              int64
    col157             uint32
    col158             object
    col159               int8
    col160              int32
    col161             uint64
    col162              int16
    col163             uint32
    col164             object
    col165     datetime64[ns]
    col166            float32
    col167               bool
    col168            float64
    col169         complex128
    col170            float16
    col171             object
    col172             uint16
    col173          complex64
    col174         complex128
    dtype: object
    

    Adding an index gets tricky because there isn't a true missing value for most data types so they end up getting cast to some other type with a native missing value (e.g., ints are cast to floats or objects), but if you have complete data of the types you've specified, then you can always insert rows as needed, and your types will be respected. This can be accomplished with:

    df.loc[index, :] = new_row
    

    Again, as @Hun pointed out, this NOT how Pandas is intended to be used.

    0 讨论(0)
  • 2020-11-28 09:02

    My solution (without setting an index) is to initialize a dataframe with column names and specify data types using astype() method.

    df = pd.DataFrame(columns=['contract',
                         'state_and_county_code',
                         'state',
                         'county',
                         'starting_membership',
                         'starting_raw_raf',
                         'enrollment_trend',
                         'projected_membership',
                         'projected_raf'])
    df = df.astype( dtype={'contract' : str, 
                     'state_and_county_code': str,
                     'state': str,
                     'county': str,
                     'starting_membership': int,
                     'starting_raw_raf': float,
                     'enrollment_trend': float,
                     'projected_membership': int,
                     'projected_raf': float})
    
    0 讨论(0)
提交回复
热议问题