问题
I am working with an existing SQLite database and experiencing errors due to the data being encoded in CP-1252, when Python is expecting it to be UTF-8.
>>> import sqlite3
>>> conn = sqlite3.connect('dnd.sqlite')
>>> curs = conn.cursor()
>>> result = curs.execute("SELECT * FROM dnd_characterclass WHERE id=802")
Traceback (most recent call last):
File "<input>", line 1, in <module>
OperationalError: Could not decode to UTF-8 column 'short_description_html'
with text ' <p>Over a dozen deities have worshipers who are paladins,
promoting law and good across Faer�n, but it is the Weave itself that
The offending character is \0xfb
which decodes to û
. Other offending texts include “?nd and slay illithids.”
which uses "smart quotes" \0x93
and \0x94
.
SQLite, python, unicode, and non-utf data details how this problem can be solved when using sqlite3
on its own.
However, I am using SQLAlchemy. How can I deal with CP-1252 encoded data in an SQLite database, when I am using SQLAlchemy?
Edit:
This would also apply for any other funny encodings in an SQLite TEXT
field, like latin-1
, cp437
, and so on.
回答1:
SQLAlchemy and SQLite are behaving normally. The solution is to fix the non-UTF-8 data in the database.
I wrote the below, drawing inspiration from https://stackoverflow.com/a/2395414/1191425 . It:
- loads up the target SQLite database
- lists all columns in all tables
- if the column is a
text
,char
, orclob
type - including variants likevarchar
andlongtext
- it re-encodes the data from theINPUT_ENCODING
to UTF-8.
INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
# Table names and column names can't be parameterized either.
db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))
After this script runs, all *text*
, *char*
, and *clob*
fields are in UTF-8 and no more Unicode decoding errors will occur. I can now Faerûn
to my heart's content.
回答2:
If you have a connection URI then you can add the following options to your DB connection URI:
DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
"charset": "cp-1252",
"use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
username=???,
...,
options=urllib.urlencode(DB_OPTIONS)
)
Assuming your SQLLite driver can handle those options (pymysql can, but I don't know 100% about sqllite), then your queries will return unicode strings.
来源:https://stackoverflow.com/questions/29035115/sqlalchemy-dealing-with-cp-1252-data-when-python-is-expecting-it-to-be-utf-8