how to set the primary key when writing a pandas dataframe to a sqlite database table using df.to_sql

前端 未结 6 845
悲&欢浪女
悲&欢浪女 2020-12-09 05:30

I have created a sqlite database using pandas df.to_sql however accessing it seems considerably slower than just reading in the 500mb csv file.

I need to:

相关标签:
6条回答
  • 2020-12-09 05:56

    There's no way to do that. You can only set the primary key directly in the database after you move the data.

    0 讨论(0)
  • 2020-12-09 05:58

    Unfortunately there is no way right now to set a primary key in the pandas df.to_sql() method. Additionally, just to make things more of a pain there is no way to set a primary key on a column in sqlite after a table has been created.

    However, a work around at the moment is to create the table in sqlite with the pandas df.to_sql() method. Then you could create a duplicate table and set your primary key followed by copying your data over. Then drop your old table to clean up.

    It would be something along the lines of this.

    import pandas as pd
    import sqlite3
    
    df = pd.read_csv("/Users/data/" +filename) 
    columns = df.columns columns = [i.replace(' ', '_') for i in columns]
    
    #write the pandas dataframe to a sqlite table
    df.columns = columns
    df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)
    
    #connect to the database
    conn = sqlite3.connect('database')
    c = conn.curser()
    
    c.executescript('''
        PRAGMA foreign_keys=off;
    
        BEGIN TRANSACTION;
        ALTER TABLE table RENAME TO old_table;
    
        /*create a new table with the same column names and types while
        defining a primary key for the desired column*/
        CREATE TABLE new_table (col_1 TEXT PRIMARY KEY NOT NULL,
                                col_2 TEXT);
    
        INSERT INTO new_table SELECT * FROM old_table;
    
        DROP TABLE old_table;
        COMMIT TRANSACTION;
    
        PRAGMA foreign_keys=on;''')
    
    #close out the connection
    c.close()
    conn.close()
    

    In the past I have done this as I have faced this issue. Just wrapped the whole thing as a function to make it more convenient...

    In my limited experience with sqlite I have found that not being able to add a primary key after a table has been created, not being able to perform Update Inserts or UPSERTS, and UPDATE JOIN has caused a lot of frustration and some unconventional workarounds.

    Lastly, in the pandas df.to_sql() method there is a a dtype keyword argument that can take a dictionary of column names:types. IE: dtype = {col_1: TEXT}

    0 讨论(0)
  • 2020-12-09 06:00

    Building on Chris Guarino's answer, it is almost impossible to assign a Primary key to an already existing column using df.to_sql() method. Likewise in your 500mb csv file you cannot create an duplicate table with huge number of columns.

    However a small Workaround of affffding a new column as Primary key while creation of dataframe to SQL. It is possible to iterate over Pandas dataframe.columns function to create a new database and while the creation you can add a Primary key. With this duplicate table is not needed.

    i am adding a small Code snippet of it.

    import pandas as pd
    import sqlite3
    import sqlalchemy 
    from sqlalchemy import create_engine
    
    df= pd.read_excel(r'C:\XXX\XXX\XXXX\XXX.xlsx',sep=';')
    X1 = df1.iloc[0:,0:]
    dataset = X1.astype('float32')
    dataset['date'] = pd.date_range(start='1/1/2020', periods=len(dataset), freq='D')
    dataset=dataset.set_index('date')
    
    engine = create_engine('sqlite:///measurement.db')
    
    sqlite_connection = engine.connect()
    
    sqlite_table = "table1"
    sqlite_connection.execute("CREATE TABLE table1 (id INTEGER PRIMARY KEY AUTOINCREMENT,  date TIMESTAMP, " +
             ",".join(["%s REAL" % x for x in dataset.columns]) + ")" )
    dataset.to_sql(sqlite_table, sqlite_connection, if_exists='append')
    
    Output database table:
    [(0, 'id', 'INTEGER', 0, None, 1),
    (1, 'date', 'TIMESTAMP', 0, None, 0),
    (2, 'time_stamp', 'REAL', 0, None, 0),
    (3, 'column_1', 'REAL', 0, None, 0),
    (4, 'column_2', 'REAL', 0, None, 0)]
    

    This method works only if the dataframe has an index. Also to have the index as column in our table it should be explicitly defined while writing our query.

    Hope this helps for huge database creations.

    0 讨论(0)
  • 2020-12-09 06:03

    Building on Chris Guarino's answer, here's some functions that provide a more general solution. See the example at the bottom for how to use them.

    import re
    
    def get_create_table_string(tablename, connection):
        sql = """
        select * from sqlite_master where name = "{}" and type = "table"
        """.format(tablename) 
        result = connection.execute(sql)
    
        create_table_string = result.fetchmany()[0][4]
        return create_table_string
    
    def add_pk_to_create_table_string(create_table_string, colname):
        regex = "(\n.+{}[^,]+)(,)".format(colname)
        return re.sub(regex, "\\1 PRIMARY KEY,",  create_table_string, count=1)
    
    def add_pk_to_sqlite_table(tablename, index_column, connection):
        cts = get_create_table_string(tablename, connection)
        cts = add_pk_to_create_table_string(cts, index_column)
        template = """
        BEGIN TRANSACTION;
            ALTER TABLE {tablename} RENAME TO {tablename}_old_;
    
            {cts};
    
            INSERT INTO {tablename} SELECT * FROM {tablename}_old_;
    
            DROP TABLE {tablename}_old_;
    
        COMMIT TRANSACTION;
        """
    
        create_and_drop_sql = template.format(tablename = tablename, cts = cts)
        connection.executescript(create_and_drop_sql)
    
    # Example:
    
    # import pandas as pd 
    # import sqlite3
    
    # df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
    # con = sqlite3.connect("deleteme.db")
    # df.to_sql("df", con, if_exists="replace")
    
    # add_pk_to_sqlite_table("df", "index", con)
    # r = con.execute("select sql from sqlite_master where name = 'df' and type = 'table'")
    # print(r.fetchone()[0])
    

    There is a gist of this code here

    0 讨论(0)
  • 2020-12-09 06:06

    There is another option for getting pandas to create a primary key on table creation using some undocumented methods from the pandas internals (at your own risk). You can peruse the code here. The key is the keys param of SQLTable which is not exposed in the to_sql API.

    Note that I reset_index and set index=False in the call to SQLTable to prevent a duplicate/unnecessary index from being created in addition to the primary key constraint.

    from pandas.io.sql import SQLTable, pandasSQL_builder
    
    df = <your dataframe>
    engine = <sqlalchemy engine>
    
    table = SQLTable(
        "my_table",
        pandasSQL_builder(engine, schema="my_schema"),
        frame=df.reset_index(),
        index=False,
        keys=df.index.names,
        if_exists=if_exists,
        schema="my_schema",
    )
    
    table.create() # Will honor your if_exists settings
    table.insert(chunksize, method="multi") # This hits limits in allowed sqlite params if chunks are too large
    

    There is also a get_schema function in that file that can get you a create table statement if you want to do something manually.

    0 讨论(0)
  • 2020-12-09 06:18

    In Sqlite, with a normal rowid table, unless the primary key is a single INTEGER column (See ROWIDs and the INTEGER PRIMARY KEY in the documentation), it's equivalent to a UNIQUE index (Because the real PK of a normal table is the rowid).

    Notes from the documentation for rowid tables:

    The PRIMARY KEY of a rowid table (if there is one) is usually not the true primary key for the table, in the sense that it is not the unique key used by the underlying B-tree storage engine. The exception to this rule is when the rowid table declares an INTEGER PRIMARY KEY. In the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.

    The true primary key for a rowid table (the value that is used as the key to look up rows in the underlying B-tree storage engine) is the rowid.

    The PRIMARY KEY constraint for a rowid table (as long as it is not the true primary key or INTEGER PRIMARY KEY) is really the same thing as a UNIQUE constraint. Because it is not a true primary key, columns of the PRIMARY KEY are allowed to be NULL, in violation of all SQL standards.

    So you can easily fake a primary key after creating the table with:

    CREATE UNIQUE INDEX mytable_fake_pk ON mytable(pk_column)
    

    Besides the NULL thing, you won't get the benefits of an INTEGER PRIMARY KEY if your column is supposed to hold integers, like taking up less space and auto-generating values on insert if left out, but it'll otherwise work for most purposes.

    0 讨论(0)
提交回复
热议问题