How to extract texts between

tags

前端 未结 3 1970
走了就别回头了
走了就别回头了 2021-01-11 18:18

I want to extract texts from HTML page(s) which placed in p and li tags, so I can start to tokenize the page to construct inverted index(es) for ea

3条回答
  •  独厮守ぢ
    2021-01-11 18:40

    This can do the job

    Elements e=doc.select("p"); 
    

    Here is a list of all selectors you can use.

    Suppose you have this html:

    String html="

    some bold text

    ";

    To get some bold text as result you should use:

    Document doc = Jsoup.parse(html);
    Element p= doc.select("p").first();
    String text = doc.body().text(); //some bold text
    

    or

    String text = p.text(); //some bold text
    

    Suppose now you have the following complex html

    String html="

    some text

    some other text

    another p tag

    "

    To get the values from the two p tags you have to do something like this

    Document doc = Jsoup.parse(html);
    Element content = doc.getElementById("someid");
    Elements p= content.getElementsByTag("p");
    
    String pConcatenated="";
    for (Element x: p) {
      pConcatenated+= x.text();
    }
    
    System.out.println(pConcatenated);//sometext another p tag
    

    You can find more info here also

    Hope this helped

提交回复
热议问题