How to extract texts between

tags

前端 未结 3 1984
走了就别回头了
走了就别回头了 2021-01-11 18:18

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for ea

3条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2021-01-11 18:46

    Try this:

    File input = new File("/home/s5/Downloads/PDFCopy/PDs.html");
            Document doc = Jsoup.parse(input, "UTF-8","http://www.cisco.com/c/en/us/products/collateral/wireless/aironet-1815-series-access-points/datasheet-c78-738481.pdf");
            Elements link = doc.select("p");
            String linkText = link.text();
            //System.out.println(linkText);
            String[] words=linkText.split("\\W");
            for(String str:words) 
            {
                System.out.println(str);
            }
        }
    }
    

提交回复
热议问题