jsoup

JSoup will not fetch all items?

白昼怎懂夜的黑 提交于 2020-01-03 08:50:14
问题 So, I am trying to parse a simple list using JSoup. Unfortunately, the program only returns the entries up til the entries that start with N in the list. I do not know why this is the case. Here is my code: public ArrayList<String> initializeMangaNameList(){ Document doc; try { doc = Jsoup.connect("http://www.mangahere.com/mangalist/").get(); Elements items = doc.getElementsByClass("manga_info"); ArrayList<String> names = new ArrayList<String>(); for(Element item: items){ names.add(item.text(

使用Jsoup解析HTML页面

≡放荡痞女 提交于 2020-01-03 07:50:27
在写Android程序时,有时需要解析HTML页面,特别是那类通过爬网站抓取数据的应用,比如:天气预报等应用。如果是桌面应用可以使用 htmlparser 这个强大的工具,但是在Android平台上使用会出现错误;另一种办法是使用正则表达式来抽取数据;再有一个办法是纯字符串查找定位来实现。文本将要介绍的是使用 Jsoup 这个开源的解析器来实现。 Jsoup既可以通过一个url网址,也可以通过存储html脚本的文件或者是存储html脚本的字符串作为数据源,然后通过DOM、CSS选择器来查找、抽取数据。 示例: //url网址作为输入源 Document doc = Jsoup.connect("http://www.example.com").timeout(60000).get(); //File文件作为输入源 File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://www.example.com/"); //String作为输入源 Document doc = Jsoup.parse(htmlStr); 和java script类似,Jsoup提供了下列的函数 getElementById(String id) 通过id获得元素

使用Jsoup解析HTML页面

时光总嘲笑我的痴心妄想 提交于 2020-01-03 07:50:07
Android开发系列十:使用Jsoup解析HTML页面 在写Android程序时,有时需要解析HTML页面,特别是那类通过爬网站抓取数据的应用,比如:天气预报等应用。如果是桌面应用可以使用 htmlparser 这个强大的工具,但是在Android平台上使用会出现错误;另一种办法是使用正则表达式来抽取数据;再有一个办法是纯字符串查找定位来实现。文本将要介绍的是使用 Jsoup 这个开源的解析器来实现。 Jsoup既可以通过一个url网址,也可以通过存储html脚本的文件或者是存储html脚本的字符串作为数据源,然后通过DOM、CSS选择器来查找、抽取数据。 示例: //url网址作为输入源 Document doc = Jsoup.connect("http://www.example.com").timeout(60000).get(); //File文件作为输入源 File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://www.example.com/"); //String作为输入源 Document doc = Jsoup.parse(htmlStr); 和java script类似,Jsoup提供了下列的函数 getElementById

Jsoup -- 网络爬虫解析器

痞子三分冷 提交于 2020-01-03 07:49:53
需要下载jsoup-1.8.1.jar包 jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。 网页获取和解析速度飞快,推荐使用。 主要功能如下: 1. 从一个URL,文件或字符串中解析HTML; 2. 使用DOM或CSS选择器来查找、取出数据; 3. 可操作HTML元素、属性、文本; 范例代码如下: Java代码 package cn.ysh.studio.crawler.jsoup; import java.io.IOException; import org.jsoup.Jsoup; /** * 基于Jsoup抓取网页内容 * @author www.yshjava.cn */ public class JsoupTest { public static void main(String[] args) throws IOException { //目标页面 String url = "http://www.yshjava.cn"; //使用Jsoup连接目标页面,并执行请求,获取服务器响应内容 String html = Jsoup.connect(url).execute().body(); //打印页面内容 System.out

Extract HTML Table ( span ) tags using Jsoup in Java

守給你的承諾、 提交于 2020-01-03 05:24:09
问题 I am trying to extract the td name and the span class. In the sample code, I want to extract the a href with in the first td "accessory" and the span tag in the second td. I want to print Mouse, is-present, yes KeyBoard, No Dual-Monitor, is-present, Yes When I use the below Java code, I get, Mouse Yes Keyboard No Dual-Monitor Yes. How do I get the span class name? HTML Code <tr> <td class="" width="1%" style="padding:0px;"> </td> <td class=""> <a href="/accessory">Mouse</a> </td> <td class=

How to parse a page with multiple tables

こ雲淡風輕ζ 提交于 2020-01-03 04:52:06
问题 Any idea on how to scrape a web page with multiple tables? I am connecting to the web page This is one table but on the same web page there are multiple tables I also cant figure out how to read the table... XML: <p><a href="/fantasy_news/feature/?ID=49818"><strong>Top 300 Overall Fantasy Rankings</strong></a></p> <div class="storyStats"> <table> <thead> <tr> <th>RANK</th> <th>CENTRES</th> <th>TEAM</th> <th>POS</th> <th>GP</th> <th>G</th> <th>A</th> <th>PTS</th> <th>+/-</th> <th>PIM</th> <th

Extract the thread head and thread reply from a forum

无人久伴 提交于 2020-01-03 04:46:04
问题 I want to extract only the views and replies of the user and the title of the head from a forum. In this code when you supply a url the code returns everything. I just want only the thread heading which is defined in title tag and the user reply which is in between the div content tag. Help me how extract. Explain how to print this in a txt file package extract; import java.io.*; import org.jsoup.*; import org.jsoup.nodes.*; public class TestJsoup { public void SimpleParse() { try { Document

Find most frequent words on a webpage (using Jsoup)?

无人久伴 提交于 2020-01-02 23:14:44
问题 In my project I have to count the most frequent words in a Wikipedia article. I found Jsoup for parsing HTML format, but that still leaves the problem of word frequency. Is there a function in Jsoup that count the freqeuncy of words, or any way to find which words are the most frequent on a webpage, using Jsoup ? Thanks. 回答1: Yes, you could use Jsoup to get the text from the webpage, like this: Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); String text = doc.body().text();

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503 (google scholar ban?)

别等时光非礼了梦想. 提交于 2020-01-02 13:56:13
问题 I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error: org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors

JSoup.clean() is not preserving relative URLs

浪尽此生 提交于 2020-01-02 08:40:08
问题 I have tried: Whitelist.relaxed(); Whitelist.relaxed().preserveRelativeLinks(true); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp"); Whitelist.relaxed().addProtocols("a","href","#","/","http","https","mailto","ftp").preserveRelativeLinks(true); None of them work: When I try to clean a relative url, like <a href="/test.xhtml">test</a> I get the href attribute removed ( <a>test</a> ). I am using JSoup 1.8.2. Any ideas? 回答1: The problem most likely stems from