using sqlalchemy to load csv file into a database

前端 未结 4 660
感动是毒
感动是毒 2020-11-30 22:38

I would like to use csv files in to a database

相关标签:
4条回答
  • 2020-11-30 23:02

    I have had the exact same problem, and I found it paradoxically easier to use a 2-step process with pandas:

    import pandas as pd
    with open(csv_file_path, 'r') as file:
        data_df = pd.read_csv(file)
    data_df.to_sql('tbl_name', con=engine, index=True, index_label='id', if_exists='replace')
    

    Note that my approach is similar to this one, but somehow Google sent me to this thread instead, so I thought I would share.

    0 讨论(0)
  • 2020-11-30 23:07

    To import a relatively small CSV file into database using sqlalchemy, you can use engine.execute(my_table.insert(), list_of_row_dicts), as described in detail in the "Executing Multiple Statements" section of the sqlalchemy tutorial.

    This is sometimes referred to as "executemany" style of invocation, because it results in an executemany DBAPI call. The DB driver might execute a single multi-value INSERT .. VALUES (..), (..), (..) statement, which results in fewer round-trips to the DB and faster execution:

    • the MySQL connector does that by default
    • Postgres' psycopg2 does not, unless you initialize it with create_engine(..., executemany_mode='values'))

    According to the sqlalchemy's FAQ, this is the fastest you can get without using DB-specific bulk loading methods, such as COPY FROM in Postgres, LOAD DATA LOCAL INFILE in MySQL, etc. In particular it's faster than using plain ORM (as in the answer by @Manuel J. Diaz here), bulk_save_objects, or bulk_insert_mappings.

    import csv
    from sqlalchemy import create_engine, Table, Column, Integer, MetaData
    
    engine = create_engine('sqlite:///sqlalchemy.db', echo=True)
    
    metadata = MetaData()
    # Define the table with sqlalchemy:
    my_table = Table('MyTable', metadata,
        Column('foo', Integer),
        Column('bar', Integer),
    )
    metadata.create_all(engine)
    insert_query = my_table.insert()
    
    # Or read the definition from the DB:
    # metadata.reflect(engine, only=['MyTable'])
    # my_table = Table('MyTable', metadata, autoload=True, autoload_with=engine)
    # insert_query = my_table.insert()
    
    # Or hardcode the SQL query:
    # insert_query = "INSERT INTO MyTable (foo, bar) VALUES (:foo, :bar)"
    
    with open('test.csv', 'r', encoding="utf-8") as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')
        engine.execute(
            insert_query,
            [{"foo": row[0], "bar": row[1]} 
                for row in csv_reader]
        )
    
    0 讨论(0)
  • 2020-11-30 23:08

    In case your CSV is quite large, using INSERTS is very ineffective. You should use a bulk loading mechanisms, which differ from base to base. E.g. in PostgreSQL you should use "COPY FROM" method:

    with open(csv_file_path, 'r') as f:    
        conn = create_engine('postgresql+psycopg2://...').raw_connection()
        cursor = conn.cursor()
        cmd = 'COPY tbl_name(col1, col2, col3) FROM STDIN WITH (FORMAT CSV, HEADER FALSE)'
        cursor.copy_expert(cmd, f)
        conn.commit()
    
    0 讨论(0)
  • 2020-11-30 23:21

    Because of the power of SQLAlchemy, I'm also using it on a project. It's power comes from the object-oriented way of "talking" to a database instead of hardcoding SQL statements that can be a pain to manage. Not to mention, it's also a lot faster.

    To answer your question bluntly, yes! Storing data from a CSV into a database using SQLAlchemy is a piece of cake. Here's a full working example (I used SQLAlchemy 1.0.6 and Python 2.7.6):

    from numpy import genfromtxt
    from time import time
    from datetime import datetime
    from sqlalchemy import Column, Integer, Float, Date
    from sqlalchemy.ext.declarative import declarative_base
    from sqlalchemy import create_engine
    from sqlalchemy.orm import sessionmaker
    
    def Load_Data(file_name):
        data = genfromtxt(file_name, delimiter=',', skip_header=1, converters={0: lambda s: str(s)})
        return data.tolist()
    
    Base = declarative_base()
    
    class Price_History(Base):
        #Tell SQLAlchemy what the table name is and if there's any table-specific arguments it should know about
        __tablename__ = 'Price_History'
        __table_args__ = {'sqlite_autoincrement': True}
        #tell SQLAlchemy the name of column and its attributes:
        id = Column(Integer, primary_key=True, nullable=False) 
        date = Column(Date)
        opn = Column(Float)
        hi = Column(Float)
        lo = Column(Float)
        close = Column(Float)
        vol = Column(Float)
    
    if __name__ == "__main__":
        t = time()
    
        #Create the database
        engine = create_engine('sqlite:///csv_test.db')
        Base.metadata.create_all(engine)
    
        #Create the session
        session = sessionmaker()
        session.configure(bind=engine)
        s = session()
    
        try:
            file_name = "t.csv" #sample CSV file used:  http://www.google.com/finance/historical?q=NYSE%3AT&ei=W4ikVam8LYWjmAGjhoHACw&output=csv
            data = Load_Data(file_name) 
    
            for i in data:
                record = Price_History(**{
                    'date' : datetime.strptime(i[0], '%d-%b-%y').date(),
                    'opn' : i[1],
                    'hi' : i[2],
                    'lo' : i[3],
                    'close' : i[4],
                    'vol' : i[5]
                })
                s.add(record) #Add all the records
    
            s.commit() #Attempt to commit all the records
        except:
            s.rollback() #Rollback the changes on error
        finally:
            s.close() #Close the connection
        print "Time elapsed: " + str(time() - t) + " s." #0.091s
    

    (Note: this is not necessarily the "best" way to do this, but I think this format is very readable for a beginner; it's also very fast: 0.091s for 251 records inserted!)

    I think if you go through it line by line, you'll see what a breeze it is to use. Notice the lack of SQL statements -- hooray! I also took the liberty of using numpy to load the CSV contents in two lines, but it can be done without it if you like.

    If you wanted to compare against the traditional way of doing it, here's a full-working example for reference:

    import sqlite3
    import time
    from numpy import genfromtxt
    
    def dict_factory(cursor, row):
        d = {}
        for idx, col in enumerate(cursor.description):
            d[col[0]] = row[idx]
        return d
    
    
    def Create_DB(db):      
        #Create DB and format it as needed
        with sqlite3.connect(db) as conn:
            conn.row_factory = dict_factory
            conn.text_factory = str
    
            cursor = conn.cursor()
    
            cursor.execute("CREATE TABLE [Price_History] ([id] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL UNIQUE, [date] DATE, [opn] FLOAT, [hi] FLOAT, [lo] FLOAT, [close] FLOAT, [vol] INTEGER);")
    
    
    def Add_Record(db, data):
        #Insert record into table
        with sqlite3.connect(db) as conn:
            conn.row_factory = dict_factory
            conn.text_factory = str
    
            cursor = conn.cursor()
    
            cursor.execute("INSERT INTO Price_History({cols}) VALUES({vals});".format(cols = str(data.keys()).strip('[]'), 
                        vals=str([data[i] for i in data]).strip('[]')
                        ))
    
    
    def Load_Data(file_name):
        data = genfromtxt(file_name, delimiter=',', skiprows=1, converters={0: lambda s: str(s)})
        return data.tolist()
    
    
    if __name__ == "__main__":
        t = time.time() 
    
        db = 'csv_test_sql.db' #Database filename 
        file_name = "t.csv" #sample CSV file used:  http://www.google.com/finance/historical?q=NYSE%3AT&ei=W4ikVam8LYWjmAGjhoHACw&output=csv
    
        data = Load_Data(file_name) #Get data from CSV
    
        Create_DB(db) #Create DB
    
        #For every record, format and insert to table
        for i in data:
            record = {
                    'date' : i[0],
                    'opn' : i[1],
                    'hi' : i[2],
                    'lo' : i[3],
                    'close' : i[4],
                    'vol' : i[5]
                }
            Add_Record(db, record)
    
        print "Time elapsed: " + str(time.time() - t) + " s." #3.604s
    

    (Note: even in the "old" way, this is by no means the best way to do this, but it's very readable and a "1-to-1" translation from the SQLAlchemy way vs. the "old" way.)

    Notice the the SQL statements: one to create the table, the other to insert records. Also, notice that it's a bit more cumbersome to maintain long SQL strings vs. a simple class attribute addition. Liking SQLAlchemy so far?

    As for your foreign key inquiry, of course. SQLAlchemy has the power to do this too. Here's an example of how a class attribute would look like with a foreign key assignment (assuming the ForeignKey class has also been imported from the sqlalchemy module):

    class Asset_Analysis(Base):
        #Tell SQLAlchemy what the table name is and if there's any table-specific arguments it should know about
        __tablename__ = 'Asset_Analysis'
        __table_args__ = {'sqlite_autoincrement': True}
        #tell SQLAlchemy the name of column and its attributes:
        id = Column(Integer, primary_key=True, nullable=False) 
        fid = Column(Integer, ForeignKey('Price_History.id'))
    

    which points the "fid" column as a foreign key to Price_History's id column.

    Hope that helps!

    0 讨论(0)
提交回复
热议问题