I need to convert large UTF-8 strings into ASCII. It should be reversible, and ideally a quick/lightweight algorithm.
How can I do this? I need the source
Any UTF-8 string that is reversibly convertible to ASCII is already ASCII.
UTF-8 can represent any unicode character - ASCII cannot.
If the string is encoded as UTF-8, it's not a string any more. It's binary data, and if you want to represent the binary data as ASCII, you have to format it into a string that can be represented using the limited ASCII character set.
One way is to use base-64 encoding (example in C#):
string original = "asdf";
// encode the string into UTF-8 data:
byte[] encodedUtf8 = Encoding.UTF8.GetBytes(original);
// format the data into base-64:
string base64 = Convert.ToBase64String(encodedUtf8);
If you want the string encoded as ASCII data:
// encode the base-64 string into ASCII data:
byte[] encodedAscii = Encoding.ASCII.GetBytes(base64);
It is impossible to convert an UTF-8 string into ASCII but it is possible to encode Unicode as an ASCII compatible string.
Probably you want to use Punycode - this is already a standard Unicode encoding that encodes all Unicode characters into ASCII. For JavaScript code check this question
Please edit you question title and description in order to prevent others from down-voting it - do not use term conversion, use encoding.
Here is a function to convert UTF8 accents to ASCII Accents (àéèî etc) If there is an accent in the string it's converted to %239 for exemple Then on the other side, I parse the string and I know when there is an accent and what is the ASCII char.
I used it in a javascript software to send data to a microcontroller that works in ASCII.
convertUtf8ToAscii = function (str) {
var asciiStr = "";
var refTable = { // Reference table Unicode vs ASCII
199: 128, 252: 129, 233: 130, 226: 131, 228: 132, 224: 133, 231: 135, 234: 136, 235: 137, 232: 138,
239: 139, 238: 140, 236: 141, 196: 142, 201: 144, 244: 147, 246: 148, 242: 149, 251: 150, 249: 151
};
for(var i = 0; i < str.length; i++){
var ascii = refTable[str.charCodeAt(i)];
if (ascii != undefined)
asciiStr += "%" +ascii;
else
asciiStr += str[i];
}
return asciiStr;
}
Do you want to strip all non ascii chars (slash replace them with '?', etc) or to store Unicode code points in a non unicode system?
First can be done in a loop checking for values > 128 and replacing them.
If you don't want to use "any platform/framework/library" then you will need to write your own encoder. Otherwise I'd just use JQuery's .html();
An implementation of the quote()
function might do what you want.
My version can be found here
You can use eval()
to reverse the encoding:
var foo = 'Hägar';
var quotedFoo = quote(foo);
var unquotedFoo = eval(quotedFoo);
alert(foo === unquotedFoo);