tags
I want to extract texts from HTML page(s) which placed in p
and li
tags, so I can start to tokenize the page to construct inverted index(es) for ea
This can do the job
Elements e=doc.select("p");
Here is a list of all selectors you can use.
Suppose you have this html:
String html="some bold text
";
To get some bold text
as result you should use:
Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String text = doc.body().text(); //some bold text
or
String text = p.text(); //some bold text
Suppose now you have the following complex html
String html="some text
some other text another p tag
"
To get the values from the two p
tags you have to do something like this
Document doc = Jsoup.parse(html);
Element content = doc.getElementById("someid");
Elements p= content.getElementsByTag("p");
String pConcatenated="";
for (Element x: p) {
pConcatenated+= x.text();
}
System.out.println(pConcatenated);//sometext another p tag
You can find more info here also
Hope this helped