I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I
I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "" +
"Some text — invalid! ";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
Some text — invalid!
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file