Fastest way to read huge MySQL table in python

问题

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:

import pandas as pd
import numpy as np
import pymysql.cursors

connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')

try:
    with connection.cursor() as cursor:
        query = "SELECT * FROM example_table;"

        chunks=[]

        for chunk in pd.read_sql(query, connection, chunksize = 1000):
            chunks.append(chunk)
        #print(len(chunks))    
        result = pd.concat(chunks, ignore_index=True)
        #print(type(result))
        #print(result)

finally:
    print("Done!")

    connection.close()

Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.

Maybe is there a better/faster way to select the data from a relational database within python?

回答1:

Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.

Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.

This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.

Example 1

import pandas as pd
import MySQLdb

def worker(y):
    #where y is value in an indexed column, e.g. a category
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x

p.close()
results = pd.concat(data)

Example 2

import pandas as pd
import MySQLdb
import datetime

def worker(a,b):
    #where a and b are timestamps
    connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
    query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
    return pd.read_sql(query, connection)

p = multiprocessing.Pool(processes=10) 
#(or however many process you want to allocate)

date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)

date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)

p.close()
results = pd.concat(data)

Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.

回答2:

You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).

pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.

Usage is basically the same as pymsql:

import MySQLdb

connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')

Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?

回答3:

For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.

import mysql.connector

i = 1
limit = 1000

while True:
    sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
    cursor.execute(sql)
    rows = self.cursor.fetchall()

    if not len(rows):  # break the loop when no more rows
        print("Done!")
        break

    for row in rows:   # do something with results
        print(row)

    i += limit

来源：https://stackoverflow.com/questions/51152183/fastest-way-to-read-huge-mysql-table-in-python

标签

python

mysql

pandas

numpy