SQLAlchemy: dealing with CP-1252 data when Python is expecting it to be UTF-8

问题

I am working with an existing SQLite database and experiencing errors due to the data being encoded in CP-1252, when Python is expecting it to be UTF-8.

>>> import sqlite3
>>> conn = sqlite3.connect('dnd.sqlite')
>>> curs = conn.cursor()
>>> result = curs.execute("SELECT * FROM dnd_characterclass WHERE id=802")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OperationalError: Could not decode to UTF-8 column 'short_description_html'
with text ' <p>Over a dozen deities have worshipers who are paladins, 
promoting law and good across Faer�n, but it is the Weave itself that

The offending character is \0xfb which decodes to û. Other offending texts include “?nd and slay illithids.” which uses "smart quotes" \0x93 and \0x94.

SQLite, python, unicode, and non-utf data details how this problem can be solved when using sqlite3 on its own.

However, I am using SQLAlchemy. How can I deal with CP-1252 encoded data in an SQLite database, when I am using SQLAlchemy?

Edit:

This would also apply for any other funny encodings in an SQLite TEXT field, like latin-1, cp437, and so on.

回答1:

SQLAlchemy and SQLite are behaving normally. The solution is to fix the non-UTF-8 data in the database.

I wrote the below, drawing inspiration from https://stackoverflow.com/a/2395414/1191425 . It:

loads up the target SQLite database
lists all columns in all tables
if the column is a text, char, or clob type - including variants like varchar and longtext - it re-encodes the data from the INPUT_ENCODING to UTF-8.

INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
    columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
    for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
        if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
            # Table names and column names can't be parameterized either.
            db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))

After this script runs, all *text*, *char*, and *clob* fields are in UTF-8 and no more Unicode decoding errors will occur. I can now Faerûn to my heart's content.

回答2:

If you have a connection URI then you can add the following options to your DB connection URI:

DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
    "charset": "cp-1252",
    "use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
    username=???,
    ...,
    options=urllib.urlencode(DB_OPTIONS)        
)

Assuming your SQLLite driver can handle those options (pymysql can, but I don't know 100% about sqllite), then your queries will return unicode strings.

来源：https://stackoverflow.com/questions/29035115/sqlalchemy-dealing-with-cp-1252-data-when-python-is-expecting-it-to-be-utf-8

标签

python

sqlite

sqlalchemy