Why does Apache Commons consider '१२३' numeric?

心已入冬 提交于 2019-11-28 16:36:15

Because that "CharSequence contains only Unicode digits" (quoting your linked documentation).

All of the characters return true for Character.isDigit:

Some Unicode character ranges that contain digits:

  • '\u0030' through '\u0039', ISO-LATIN-1 digits ('0' through '9')
  • '\u0660' through '\u0669', Arabic-Indic digits
  • '\u06F0' through '\u06F9', Extended Arabic-Indic digits
  • '\u0966' through '\u096F', Devanagari digits
  • '\uFF10' through '\uFF19', Fullwidth digits

Many other character ranges contain digits as well.

१२३ are Devanagari digits:

ΦXocę 웃 Пepeúpa ツ

The symbol १२३ is the same as 123 for the Nepali language or any other language using the Devanagari script such as Hindi, Gujarati, and so on, and is therefore is a number for Apache Commons.

You can use Character#getType to check the character's general category:

System.out.println(Character.DECIMAL_DIGIT_NUMBER == Character.getType('१'));

This will print true, which is an "evidence" that '१' is a digit number.

Now let's examine the unicode value of the '१' character:

System.out.println(Integer.toHexString('१'));
// 967

This number is on the range of Devanagari digits - which is: \u0966 through \u096F.

Also try:

Character.UnicodeBlock block = Character.UnicodeBlock.of('१');
System.out.println(block.toString());
// DEVANAGARI

Devanagari is:

is an abugida (alphasyllabary) alphabet of India and Nepal

"१२३" is a "123" (Basic Latin unicode).

Reading:

If you ever want to know what properties a particular "character" has (and there are quite a few), go directly to the source: Unicode.org. They have research tools that can show you most anything you would care to know.

KEEP IN MIND: The Unicode Consortium produces a specification, not software. This means that it is up to each software vendor to implement the specification as accurately as they can. So just like HTML, JavaScript, CSS, SQL, etc, there is variation between different platforms, languages, and so on. For example, I found a bug in Microsoft's .NET Framework whereby circled Latin letters A-Z and a-z -- Code Points 0x24B6 through 0x24E9 -- do not properly register as being char.IsLetter = true (bug report here). And that leads to unexpected behavior in related functionality, such as when calling the TextInfo.ToTitleCase() method (bug report here).

Symbols '१२३' are actually derived from Hindi language(Basically from Sanskrit language i.e Devanagiri) which represent numeric values just like:

१ represent 1

२ represent 2

and like wise

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!