how to decode html codes using Java? [duplicate]

让人想犯罪 __ 提交于 2019-12-08 23:56:49

问题


Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?


回答1:


Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);



回答2:


Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or λ

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings. ™ for example is not valid, yet many browsers will interpret it as .

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

  • Feed string into a robust HTML parser
  • Get parsed (and fully decoded) string back



回答3:


Neko HTML does a lot of useful transformations on HTML and "HTML Text Parser: Converting HTML to Text in Java using NekoHTML" explains how to use it specifically to extract the textual content.



来源:https://stackoverflow.com/questions/13750290/how-to-decode-html-codes-using-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!