using javascript, how can I count a mix of asian characters and english words

淺唱寂寞╮ 提交于 2019-12-05 07:30:30

Unfortunately JavaScript's RegExp has no support for Unicode character classes; \w only applies to ASCII characters (modulo some browser bugs).

You can use Unicode characters in groups, though, so you can do it if you can isolate each set of characters you are interested in as a range. eg.:

var r= new RegExp(
    '[A-Za-z0-9_\]+|'+                             // ASCII letters (no accents)
    '[\u3040-\u309F]+|'+                           // Hiragana
    '[\u30A0-\u30FF]+|'+                           // Katakana
    '[\u4E00-\u9FFF\uF900-\uFAFF\u3400-\u4DBF]',   // Single CJK ideographs
'g');

var nwords= str.match(r).length;

(This attempts to give a more realistic count of ‘words’ for Japanese, counting each run of one type of kana as a word. That's still not right, of course, but it's probably closer than treating each syllable as one word.)

Obviously there are many more characters that would have to be accounted for if you wanted to ‘do it properly’. Let's hope you don't have characters outside the basic multilingual plane, for one!

You can iterate over each character in the text, examining each one to look for word breaks. The following example does this, counting each Chinese/Japanese/Korean (CJK) ideograph as a single word, and treating all alphanumeric strings as single words.

Some notes on my implementation:

  1. It probably doesn't handle accented characters correctly. They will probably trigger word breaks. You can modify the wordBreakRegEx to fix this.

  2. cjkRegEx doesn't include some of the more esoteric code point ranges, since they require 5 hex digits to reference and JavaScript's regex engine doesn't seem to let you do that. But you probably don't need to worry about these, since I don't even think most fonts include them.

  3. I deliberately left Japanese Hiragana and Katakana out of cjkRegEx, since I'm not sure how you want to handle these. Depending on the type of text you're dealing with, it might make more sense to treat strings of them as single words. In that case, you'd need to add logic to recognize being in a "kana word" versus in a "alphanumeric word". If you don't care, then you just need to add their code point ranges to cjkRegEx. Of course, you could try to recognize word breaks within kana strings, but that quickly becomes Very Hard.

Example implementation:

function getWordCount(text) {
  // This matches all CJK ideographs.
  var cjkRegEx = /[\u3400-\u4db5\u4e00-\u9fa5\uf900-\ufa2d]/;

  // This matches all characters that "break up" words.
  var wordBreakRegEx = /\W/;

  var wordCount = 0;
  var inWord = false;
  var length = text.length;
  for (var i = 0; i < length; i++) {
    var curChar = text.charAt(i);
    if (cjkRegEx.test(curChar)) {
      // Character is a CJK ideograph.
      // Count it as a word.
      wordCount += inWord ? 2 : 1;
      inWord = false;
    } else if (wordBreakRegEx.test(curChar)) {
      // Character is a "word-breaking" character.
      // If a word was started, increment the word count.
      if (inWord) {
        wordCount += 1;
        inWord = false;
    } else {
      // All other characters are "word" characters.
      // Indicate that a word has begun.
      inWord = true;
    }
  }

  // If the text ended while in a word, make sure to count it.
  if (inWord) {
    wordCount += 1;
  }

  return wordCount;
}

The Unihan Database is very helpful for learning about CJK in unicode. Also of course the Unicode home page has loads of info.

I think you want to loop over all characters, and increase a counter every time the current character is in a different word (according to your definition) than the previous one.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!