SQLAlchemy: dealing with CP-1252 data when Python is expecting it to be UTF-8

﹥>﹥吖頭↗ 提交于 2020-02-08 09:51:31

问题


I am working with an existing SQLite database and experiencing errors due to the data being encoded in CP-1252, when Python is expecting it to be UTF-8.

>>> import sqlite3
>>> conn = sqlite3.connect('dnd.sqlite')
>>> curs = conn.cursor()
>>> result = curs.execute("SELECT * FROM dnd_characterclass WHERE id=802")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OperationalError: Could not decode to UTF-8 column 'short_description_html'
with text ' <p>Over a dozen deities have worshipers who are paladins, 
promoting law and good across Faer�n, but it is the Weave itself that 

The offending character is \0xfb which decodes to û. Other offending texts include “?nd and slay illithids.” which uses "smart quotes" \0x93 and \0x94.

SQLite, python, unicode, and non-utf data details how this problem can be solved when using sqlite3 on its own.

However, I am using SQLAlchemy. How can I deal with CP-1252 encoded data in an SQLite database, when I am using SQLAlchemy?


Edit:

This would also apply for any other funny encodings in an SQLite TEXT field, like latin-1, cp437, and so on.


回答1:


SQLAlchemy and SQLite are behaving normally. The solution is to fix the non-UTF-8 data in the database.

I wrote the below, drawing inspiration from https://stackoverflow.com/a/2395414/1191425 . It:

  • loads up the target SQLite database
  • lists all columns in all tables
  • if the column is a text, char, or clob type - including variants like varchar and longtext - it re-encodes the data from the INPUT_ENCODING to UTF-8.

INPUT_ENCODING = 'cp1252' # The encoding you want to convert from
import sqlite3
db = sqlite3.connect('dnd_fixed.sqlite')
db.create_function('FIXENCODING', 1, lambda s: str(s).decode(INPUT_ENCODING))
cur = db.cursor()
tables = cur.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
tables = [t[0] for t in tables]
for table in tables:
    columns = cur.execute('PRAGMA table_info(%s)' % table ).fetchall() # Note: pragma arguments can't be parameterized.
    for column_id, column_name, column_type, nullable, default_value, primary_key in columns:
        if ('char' in column_type) or ('text' in column_type) or ('clob' in column_type):
            # Table names and column names can't be parameterized either.
            db.execute('UPDATE "{0}" SET "{1}" = FIXENCODING(CAST("{1}" AS BLOB))'.format(table, column_name))

After this script runs, all *text*, *char*, and *clob* fields are in UTF-8 and no more Unicode decoding errors will occur. I can now Faerûn to my heart's content.




回答2:


If you have a connection URI then you can add the following options to your DB connection URI:

DB_CONNECTION = mysql+pymysql://{username}:{password}@{host}/{db_name}?{options}
DB_OPTIONS = {
    "charset": "cp-1252",
    "use_unicode": 1,
}
connection_uri = DB_CONNECTION.format(
    username=???,
    ...,
    options=urllib.urlencode(DB_OPTIONS)        
)

Assuming your SQLLite driver can handle those options (pymysql can, but I don't know 100% about sqllite), then your queries will return unicode strings.



来源:https://stackoverflow.com/questions/29035115/sqlalchemy-dealing-with-cp-1252-data-when-python-is-expecting-it-to-be-utf-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!