I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on uBuntu Linux.
In Visual Studio I can use unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).
For example:
// A = π.r²
double π = 3.14;
GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:
wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program
wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program
Which brings me to the question:
Is there a way to get GCC to compile UTF-8 files without first removing the BOM?
I'm using:
- Windows 7
- Visual Studio 2010
and:
- uBuntu Oneiric 11.10
- GCC 4.6.1 (as provided by apt-get install gcc)
Edit:
As the first commenter pointed out, my problem was not the BOM, but having non-ascii characters outside of string constants. GCC does not like non-ascii characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.
According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers
and pre-process your code to convert the identifiers to UCN. From the linked page:
perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'
See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?
While unicode identifiers are supported in gcc, UTF-8 input is not. Therefore, unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the cpp preprocessor allows gcc and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
However, the patch is so simple it can be given right here.
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
Even with the patch, two command line options are needed to enable UTF-8 input. In particular, try something like
$ /usr/local/gcc-5.2/bin/gcc \
-finput-charset=UTF-8 -fextended-identifiers \
-o circle circle.c
来源:https://stackoverflow.com/questions/7899795/is-it-possible-to-get-gcc-to-compile-utf-8-with-bom-source-files