Regular Expression Doesn't Work Properly With Turkish Characters

淺唱寂寞╮ 提交于 2019-12-10 19:35:45

问题


I write a regex that should extracts following patterns;

  • "çççoookkk gggüüüzzzeeelll" (it means vvveeerrryyy gggoooddd with turkish characters "ç" and "ü")
  • "ccccoookkk ggguuuzzzeeelll" (it means the same but with english characters "c" and "u")

here is the regular expressions i'm trying;

  • "\b[çc]+o+k+\sg+[üu]+z+e+l+\b" : this works in english but not in turkish characters
  • "çok": finds "çok" but when i try "ç+o+k+" doesn't work for "çççoookkk", it finds "çoookkk"
  • "güzel": finds "güzel" but when i try "g+ü+z+e+l+" doesn't work for "gggüüüzzzeeelll"
  • "\b(c+o+k+)|(ç+o+k+)\s(g+u+z+e+l)|(g+ü+z+e+l+)\b": doesn't work properly
  • "[çc]ok\sg[uü]zel": I also tried this to get "çok güzel" pattern but doesn't work neither.

I thing the problem might be using regex operators with turkish characters. I don't know how can i solve this.

I am using http://www.myregextester.com to check if my regular expressions are correct.

I am using Php programming language to get a specific pattern from searched tweets via Twitter Rest Api.

Thanks,


回答1:


You have not specified what programming language you are using, but in many of them, the \b character class can only be used with plain ASCII encoding.

Internally, \b is processed as a boundary between \w and \W sets.
In turn, \w is equal to [a-zA-Z0-9_].

If you are not using any fancy space marks (you shouldn't), then consider using regular whitespace char classes (\s).

See this table (scroll down to Word Boundaries section) to check if your language supports Unicode for \b. If it says, "ascii", then it does not.

As a side note, depending on your programming language, you may consider using direct Unicode code points instead of national characters.

Se also: utf-8 word boundary regex in javascript

Further reading:

  • An excellent article about using Unicode characters in regular expressions
  • An article for word boundaries
  • List of Turkish Unicode code points


来源:https://stackoverflow.com/questions/16579113/regular-expression-doesnt-work-properly-with-turkish-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!