This question already has an answer here:
Currently I'm working on converting HTML codes with equivalent characters in java. I need to convert the below code to characters.
è - è
® - ®
& - &
ñ - ñ
& - &
I tried using the regex pattern
(&#x)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)(;)
When I debug, matcher.find()
gives me true
but the control skips the loop where I have written the code for conversion. Don't know what is happening there.
Also, is there any way to optimize this regex?
Any help is appreciated.
Exception
java.lang.NumberFormatException: For input string: "x26"
at java.lang.NumberFormatException.forInputString(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at java.lang.Integer.parseInt(Unknown Source)
at org.apache.commons.lang.Entities.unescape(Entities.java:683)
at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(StringEscapeUtils.java:483)
Also, is there any way to optimize this regex?
Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:
import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);
JavaDoc says:
Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.
For example, the string
"<Français>"
will become"<Français>"
If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g.
">&zzzz;x"
will become">&zzzz;x"
.
One of all the other possibilities or existing util methods could be spring-web's org.springframework.web.util.HtmlUtils.htmlUnescape
.
Example usage in a self-contained Groovy script:
@Grapes(
@Grab(group='org.springframework', module='spring-web', version='4.3.0.RELEASE')
)
import org.springframework.web.util.HtmlUtils
println HtmlUtils.htmlUnescape("La élite del tenis no teme al zika y jugará en Río")
来源:https://stackoverflow.com/questions/14998726/replace-html-codes-with-equivalent-characters-in-java