Pandas/Google BigQuery: Schema mismatch makes the upload fail

风流意气都作罢 提交于 2020-01-07 04:38:28

问题


The schema in my google table looks like this:

price_datetime : DATETIME,
symbol         : STRING,
bid_open       : FLOAT,
bid_high       : FLOAT,
bid_low        : FLOAT,
bid_close      : FLOAT,
ask_open       : FLOAT,
ask_high       : FLOAT,
ask_low        : FLOAT,
ask_close      : FLOAT

After I do a pandas.read_gbq I get a dataframe with column dtypes like this:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

Now I want to use to_gbq so I convert my local dataframe (which I just made) from these dtypes:

price_datetime    datetime64[ns]
symbol                    object
bid_open                 float64
bid_high                 float64
bid_low                  float64
bid_close                float64
ask_open                 float64
ask_high                 float64
ask_low                  float64
ask_close                float64
dtype: object

to these dtypes:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

by doing:

df['price_datetime'] = df['price_datetime'].astype(object)

Now I (think) I am read to use to_gbq so I do:

import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')

but I get the error:

---------------------------------------------------------------------------
InvalidSchema                             Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
      1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
      3 b = time.time()
      4 
      5 print(b-a)

C:\Users\me\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    825         elif if_exists == 'append':
    826             if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827                 raise InvalidSchema("Please verify that the structure and "
    828                                     "data types in the DataFrame match the "
    829                                     "schema of the destination table.")

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

回答1:


This is probably an issue related to pandas. If you check the code for to_gbq, you'll see that it runs this code:

table_schema = _generate_bq_schema(dataframe)

Where _generate_bq_schema is given by:

def _generate_bq_schema(df, default_type='STRING'):
    """ Given a passed df, generate the associated Google BigQuery schema.
    Parameters
    ----------
    df : DataFrame
    default_type : string
        The default big query type in case the type of the column
        does not exist in the schema.
    """

    type_mapping = {
        'i': 'INTEGER',
        'b': 'BOOLEAN',
        'f': 'FLOAT',
        'O': 'STRING',
        'S': 'STRING',
        'U': 'STRING',
        'M': 'TIMESTAMP'
    }

    fields = []
    for column_name, dtype in df.dtypes.iteritems():
        fields.append({'name': column_name,
                       'type': type_mapping.get(dtype.kind, default_type)})

    return {'fields': fields}

As you can see, there's no type mapping to DATETIME. This inevitably gets mapped to type STRING (since its dtype.kind is "O") and then conflict occurs.

The only work around for now that I'm aware of would be to change your table schema from DATETIME to either TIMESTAMP or STRING.

It probably would be a good idea to start a new issue on pandas-bq repository asking to update this code to accept DATETIME as well.

[EDIT]:

I've opened this issue in their repository.




回答2:


I had to do two things that solved the issue for me. First, I deleted my table and reuploaded it with the columns as TIMESTAMP types rather than DATETIME types. This made sure that the schema matched when the pandas.DataFrame with column type datetime64[ns] was uploaded to using to_gbq, which converts datetime64[ns] to TIMESTAMP type and not to DATETIME type (for now).

The second thing I did was upgrade from pandas 0.19 to pandas 0.20. These two things solved my problem of a schema mismatch.




回答3:


I had this issue and determined that Pandas is sending the columns in alphabetical order by column_name, which, in my case, mismatched the schema of the BigQuery Table. Hence, a column was expecting a date value when it got an integer, etc. It thus, throws the "Invalid Schema" error. Check your column order.



来源:https://stackoverflow.com/questions/44953463/pandas-google-bigquery-schema-mismatch-makes-the-upload-fail

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!