Python CSV to SQLite

前端 未结 5 801
误落风尘
误落风尘 2020-11-28 05:22

I am \"converting\" a large (~1.6GB) CSV file and inserting specific fields of the CSV into a SQLite database. Essentially my code looks like:

import csv, sq         


        
相关标签:
5条回答
  • 2020-11-28 05:53

    It's possible to import the CSV directly:

    sqlite> .separator ","
    sqlite> .import filecsv.txt mytable
    

    http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

    0 讨论(0)
  • 2020-11-28 05:54

    Try using transactions.

    begin    
    insert 50,000 rows    
    commit
    

    That will commit data periodically rather than once per row.

    0 讨论(0)
  • 2020-11-28 06:01

    Chris is right - use transactions; divide the data into chunks and then store it.

    "... Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database." - Source: http://web.utk.edu/~jplyon/sqlite/SQLite_optimization_FAQ.html

    "... there is another trick you can use to speed up SQLite: transactions. Whenever you have to do multiple database writes, put them inside a transaction. Instead of writing to (and locking) the file each and every time a write query is issued, the write will only happen once when the transaction completes." - Source: How Scalable is SQLite?

    import csv, sqlite3, time
    
    def chunks(data, rows=10000):
        """ Divides the data into 10000 rows each """
    
        for i in xrange(0, len(data), rows):
            yield data[i:i+rows]
    
    
    if __name__ == "__main__":
    
        t = time.time()
    
        conn = sqlite3.connect( "path/to/file.db" )
        conn.text_factory = str  #bugger 8-bit bytestrings
        cur = conn.cur()
        cur.execute('CREATE TABLE IF NOT EXISTS mytable (field2 VARCHAR, field4 VARCHAR)')
    
        csvData = csv.reader(open(filecsv.txt, "rb"))
    
        divData = chunks(csvData) # divide into 10000 rows each
    
        for chunk in divData:
            cur.execute('BEGIN TRANSACTION')
    
            for field1, field2, field3, field4, field5 in chunk:
                cur.execute('INSERT OR IGNORE INTO mytable (field2, field4) VALUES (?,?)', (field2, field4))
    
            cur.execute('COMMIT')
    
        print "\n Time Taken: %.3f sec" % (time.time()-t) 
    
    0 讨论(0)
  • 2020-11-28 06:02

    Pandas makes it easy to load big files into databases in chunks. Read the CSV file into a Pandas DataFrame and then use the Pandas SQL writer (so Pandas does all the hard work). Here's how to load the data in 100,000 row chunks.

    import pandas as pd
    
    orders = pd.read_csv('path/to/your/file.csv')
    orders.to_sql('orders', conn, if_exists='append', index = False, chunksize=100000)
    

    Modern Pandas versions are very performant. Don't reinvent the wheel. See here for more info.

    0 讨论(0)
  • 2020-11-28 06:05

    As it's been said (Chris and Sam), transactions do improve a lot insert performance.

    Please, let me recommend another option, to use a suite of Python utilities to work with CSV, csvkit.

    To install:

    pip install csvkit
    

    To solve your problem

    csvsql --db sqlite:///path/to/file.db --insert --table mytable filecsv.txt
    
    0 讨论(0)
提交回复
热议问题