Using boilerpipe to extract non-english articles
I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem. In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper . I found no solution in this paper. My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around