问题
I apologize for making a character encoding question since I know you folk get many everyday, but I couldn't figure out my problem so I asked anyway.
Here is what we are doing:
- Take Data from an Oracle DB using Python and
cx_Oracle
. - Write the data to a file using Python.
- Ingest the file into Postgres using Python and
psycopg2
.
Here are the important Oracle settings:
SQL> select * from NLS_DATABASE_PARAMETERS;
PARAMETER VALUE
------------------------------ ----------------------------------------
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CURRENCY $
NLS_ISO_CURRENCY AMERICA
NLS_NUMERIC_CHARACTERS .,
NLS_CHARACTERSET US7ASCII
According to this NLS_LANG
faq, you are meant to set the NLS_LANG according to what your client OS is using.
Running locale
gives us: LANG=en_US.UTF-8
(all of the other fields were also en_US.UTF-8).
So, in our Python script, we set it like this:
os.environ["NLS_LANG"] = "AMERICAN_AMERICA.AL32UTF8"
Then we import the data and write it to a file.
row = cur.fetchall()
fil.write(row[0][0]) #For this test, I am only writing one row and one field.
We ingest that file into our UTF-8 Postgres DB.
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
(In some text editors, the symbol shows up as �
).
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
And even if we are using regional pages, shouldn't it still work, since the client is using US and the Oracle server is using AMERICAN?
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Note: The Oracle field is a CHAR
field and not a NCHAR
field.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
Thanks for your time, I hope you have a good day.
回答1:
Unfortunately, for some reason, we get this symbol: � in our file and the subsequent PG table as well. If my understanding is correct, this is the Replace Character. I believe that character is meant to show up if Unicode does not recognize a symbol.
Mostly right but not quite. PostgreSQL will refuse to insert non-UTF8 text characters when using that encoding (do a search on StackOverflow for "Invalid UTF8 postgresql"). Most likely the character you are seeing is a valid UTF8 character that is not recognized by your font and therefore is showing the replacement character. If the symbol is in your Oracle db and is actually the replacement symbol there, then what do you want to replace it with? If that is the case, the information is already missing.
What I don't understand is why is this happening? I thought UTF-8 was backwards compatible with 7-bit ASCII?
It is.
How can I check if the data is imported correctly and if it isn't correct, how can I fix it so future imports are?
Most likely your problem is upstream of the Oracle db. I would find out what is actually inserting problem data into the Oracle db and fix it there. If you can check the data in Pg against the data in Oracle, you should be able to determine if the data is character for character the same (and flag any differences). That's how to check your current import.
Note2: We are using Python 2.4, so we don't have the native Unicode stuff in Python 3.X. So, it is possible that Python is messing up somewhere though I thought cx_Oracle took care of it all.
That's another possibility. Personally for file transformations I prefer Perl because of integrated regular expressions and absolutely top rate PostgreSQL support. However I recognize your import routine may not be readily convertable at this point. I am a little more familiar with troubleshooting UTF8 conversion issues in Perl than in Python. I do wonder however if you can check the data that is coming out in binary format for such symbols.
来源:https://stackoverflow.com/questions/15144833/importing-from-oracle-using-the-correct-encoding-with-python