问题
Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?
I need to extract paragraphs (like title
in StackOverflow) from an html
file.
I can use regular expressions in Java to extract the fields I need but I have to decode
the fields obtained.
EXAMPLE
field extracted:
Paging Lucene's search results (with **;** among **'** and **s**)
field after decoding:
Paging Lucene's search results
Is there any class in java that will allow me to convert these html codes?
回答1:
Use methods provided by Apache Commons Lang
import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);
回答2:
Do not try to solve everything by regexp.
While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.
See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!
Chuck Norris can parse HTML with regex.
The bad news is: there is more than one way to encode characters.
https://en.wikipedia.org/wiki/Character_encodings_in_HTML
For example, the character 'λ' can be represented as
λ
,λ
orλ
And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™
for example is not valid, yet many browsers will interpret it as ™
.
Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.
So I strongly recommend:
- Feed string into a robust HTML parser
- Get parsed (and fully decoded) string back
回答3:
Neko HTML does a lot of useful transformations on HTML and "HTML Text Parser: Converting HTML to Text in Java using NekoHTML" explains how to use it specifically to extract the textual content.
来源:https://stackoverflow.com/questions/13750290/how-to-decode-html-codes-using-java