What is dtype('O'), in pandas?

前端 未结 4 1062
我在风中等你
我在风中等你 2020-12-02 07:08

I have a dataframe in pandas and I\'m trying to figure out what the types of its values are. I am unsure what the type is of column \'Test\'. However, when I ru

4条回答
  •  抹茶落季
    2020-12-02 07:58

    When you see dtype('O') inside dataframe this means Pandas string.

    What is dtype?

    Something that belongs to pandas or numpy, or both, or something else? If we examine pandas code:

    df = pd.DataFrame({'float': [1.0],
                        'int': [1],
                        'datetime': [pd.Timestamp('20180310')],
                        'string': ['foo']})
    print(df)
    print(df['float'].dtype,df['int'].dtype,df['datetime'].dtype,df['string'].dtype)
    df['string'].dtype
    

    It will output like this:

       float  int   datetime string    
    0    1.0    1 2018-03-10    foo
    ---
    float64 int64 datetime64[ns] object
    ---
    dtype('O')
    

    You can interpret the last as Pandas dtype('O') or Pandas object which is Python type string, and this corresponds to Numpy string_, or unicode_ types.

    Pandas dtype    Python type     NumPy type          Usage
    object          str             string_, unicode_   Text
    

    Like Don Quixote is on ass, Pandas is on Numpy and Numpy understand the underlying architecture of your system and uses the class numpy.dtype for that.

    Data type object is an instance of numpy.dtype class that understand the data type more precise including:

    • Type of the data (integer, float, Python object, etc.)
    • Size of the data (how many bytes is in e.g. the integer)
    • Byte order of the data (little-endian or big-endian)
    • If the data type is structured, an aggregate of other data types, (e.g., describing an array item consisting of an integer and a float)
    • What are the names of the "fields" of the structure
    • What is the data-type of each field
    • Which part of the memory block each field takes
    • If the data type is a sub-array, what is its shape and data type

    In the context of this question dtype belongs to both pands and numpy and in particular dtype('O') means we expect the string.


    Here is some code for testing with explanation: If we have the dataset as dictionary

    import pandas as pd
    import numpy as np
    from pandas import Timestamp
    
    data={'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'date': {0: Timestamp('2018-12-12 00:00:00'), 1: Timestamp('2018-12-12 00:00:00'), 2: Timestamp('2018-12-12 00:00:00'), 3: Timestamp('2018-12-12 00:00:00'), 4: Timestamp('2018-12-12 00:00:00')}, 'role': {0: 'Support', 1: 'Marketing', 2: 'Business Development', 3: 'Sales', 4: 'Engineering'}, 'num': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567}, 'fnum': {0: 3.14, 1: 2.14, 2: -0.14, 3: 41.3, 4: 3.14}}
    df = pd.DataFrame.from_dict(data) #now we have a dataframe
    
    print(df)
    print(df.dtypes)
    

    The last lines will examine the dataframe and note the output:

       id       date                  role  num   fnum
    0   1 2018-12-12               Support  123   3.14
    1   2 2018-12-12             Marketing  234   2.14
    2   3 2018-12-12  Business Development  345  -0.14
    3   4 2018-12-12                 Sales  456  41.30
    4   5 2018-12-12           Engineering  567   3.14
    id               int64
    date    datetime64[ns]
    role            object
    num              int64
    fnum           float64
    dtype: object
    

    All kind of different dtypes

    df.iloc[1,:] = np.nan
    df.iloc[2,:] = None
    

    But if we try to set np.nan or None this will not affect the original column dtype. The output will be like this:

    print(df)
    print(df.dtypes)
    
        id       date         role    num   fnum
    0  1.0 2018-12-12      Support  123.0   3.14
    1  NaN        NaT          NaN    NaN    NaN
    2  NaN        NaT         None    NaN    NaN
    3  4.0 2018-12-12        Sales  456.0  41.30
    4  5.0 2018-12-12  Engineering  567.0   3.14
    id             float64
    date    datetime64[ns]
    role            object
    num            float64
    fnum           float64
    dtype: object
    

    So np.nan or None will not change the columns dtype, unless we set the all column rows to np.nan or None. In that case column will become float64 or object respectively.

    You may try also setting single rows:

    df.iloc[3,:] = 0 # will convert datetime to object only
    df.iloc[4,:] = '' # will convert all columns to object
    

    And to note here, if we set string inside a non string column it will become string or object dtype.

提交回复
热议问题