tags
I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for ea
Try this:
File input = new File("/home/s5/Downloads/PDFCopy/PDs.html");
Document doc = Jsoup.parse(input, "UTF-8","http://www.cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf");
Elements link = doc.select("p");
String linkText = link.text();
//System.out.println(linkText);
String[] words=linkText.split("\\W");
for(String str:words)
{
System.out.println(str);
}
}
}