I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messe
Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the \p{}
property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.
You can use \p{Latin}
, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.