How to find whether a particular string has unicode characters (esp. Double Byte characters)

后端未结

关注

 6  891

我在风中等你 2020-12-08 13:55

To be more precise, I need to know whether (and if possible, how) I can find whether a given string has double byte characters or not. Basically, I need to open a pop-up to

6条回答

没有蜡笔的小新 (楼主)

2020-12-08 14:26

I have benchmarked the two functions in the top answers and thought I would share the results. Here is the test code I used:

const text1 = `The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. 中文維基百科的副標題是「海納百川，有容乃大」，這是中国的清朝政治家林则徐（1785年－1850年）於1839年為`;

const regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsNonLatinCodepoints(s) {
    return regex.test(s);
}

function isDoubleByte(str) {
    for (var i = 0, n = str.length; i < n; i++) {
        if (str.charCodeAt( i ) > 255) { return true; }
    }
    return false;
}

function benchmark(fn, str) {
    let startTime = new Date();
    for (let i = 0; i < 10000000; i++) {
        fn(str);
    }   
    let endTime = new Date();

    return endTime.getTime() - startTime.getTime();
}

console.info('isDoubleByte => ' + benchmark(isDoubleByte, text1));
console.info('containsNonLatinCodepoints => ' + benchmark(containsNonLatinCodepoints, text1));

When running this I got:

isDoubleByte => 2421
containsNonLatinCodepoints => 868

So for this particular string the regex solution is about 3 times faster.

However note that for a string where the first character is unicode, isDoubleByte() returns right away and so is much faster than the regex (which still has the overhead of the regular expression).

For instance for the string 中国, I got these results:

isDoubleByte => 51
containsNonLatinCodepoints => 288

To get the best of both world, it's probably better to combine both:

var regex = /[^\u0000-\u00ff]/; // Small performance gain from pre-compiling the regex
function containsDoubleByte(str) {
    if (!str.length) return false;
    if (str.charCodeAt(0) > 255) return true;
    return regex.test(str);
}

In that case, if the first character is Chinese (which is likely if the whole text is Chinese), the function will be fast and return right away. If not, it will run the regex, which is still faster than checking each character individually.

0 讨论(0)

查看其它6个回答