Extract the thread head and thread reply from a forum

无人久伴 提交于 2020-01-03 04:46:04

问题


I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file

package extract;

import java.io.*;

import org.jsoup.*;

import org.jsoup.nodes.*;

public class TestJsoup
{
   public void SimpleParse()  
   {        
        try  
        {

            Document doc = Jsoup.connect("url").get();

            doc.body().wrap("<div></div>");

            doc.body().wrap("<pre></pre>");
            String text = doc.text();
           // Converting nbsp entities

            text = text.replaceAll("\u00A0", " ");

            System.out.print(text);

         }   
         catch (IOException e) 
         {

            e.printStackTrace();

         }

    }

    public static void main(String args[])
    {

      TestJsoup tjs = new TestJsoup();

      tjs.SimpleParse();

    }

}

回答1:


Why do you wrapt the body-Element in a div and a pre Tag?

The title-Element can be selected like this:

Document doc = Jsoup.connect("url").get();

Element titleElement = doc.select("title").first();
String titleText = titleElement.text();

// Or shorter ...

String titleText = doc.select("title").first().text();

Div-Tags:

// Document 'doc' as above

Elements divTags = doc.select("div");


for( Element element : divTags )
{
    // Do something there ... eg. print each element
    System.out.println(element);

    // Or get the Text of it
    String text = element.text();
}

Here's an overview about the whole Jsoup Selector API, this will help you finding any kind of element you need.




回答2:


Well I used another code and I collected data from this specific tags.

Elements content = doc.getElementsByTag("blockquote");

Elements k=doc.select("[postcontent restore]");

content.select("blockquote").remove();

content.select("br").remove();

content.select("div").remove();

content.select("a").remove();

content.select("b").remove();



来源:https://stackoverflow.com/questions/13005872/extract-the-thread-head-and-thread-reply-from-a-forum

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!