I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I
—
>
Another approach, since you're not using a rigid OXM approach anyway. You might want to try using a less rigid parser such as JSoup? This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.