jsoup | 易学教程

JSOUP parsing HTML get class inside class

阅读更多关于 JSOUP parsing HTML get class inside class

问题 i am developing android application using JSOUP for parsing HTML. i have HTML syntax <div class='wrapper'> <div style='margin:7px;'> <div class='box' style='height:595px'> <div class='boxtitlebox'> <div class='boxtitle'><h4>13 RECENT CHORDS</h4></div><div class='clear'></div> </div> <div class='listitem'><a href='http://www.chordfrenzy.com/chord/9742/ungu-apa-sih-maumu-kord-lirik-lagu'> <div class='subtitle'>Chord Ungu</div> <div class='title'>Apa Sih Maumu</div> </a></div> <div class=

Scraping XML with JSoup

阅读更多关于 Scraping XML with JSoup

问题 I'm trying to scrape an RSS feed located here. At the moment I'm just trying to wrap my head around JSoup, so the following code is merely proof of concept (or an attempt at it, at least). public static void grabShakers(String url) throws IOException { doc = Jsoup.connect(url).get(); desc = doc.select("title"); links = doc.select("link"); price = doc.select("span.price"); } It grabs the title of each item perfectly. The output of each link is simply ten repeated closing link tags and it never

Extracting “hidden” HTML with Jsoup

阅读更多关于 Extracting “hidden” HTML with Jsoup

问题 I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by "inspect element" in Google Chrome. Example page: http://assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false There are a number of div elements containing assignment data for U.S. Patent No. 9,000,000 that appear below the line <script async="async" type="text/javascript" src="https://components.uspto.gov/js/ais/2-2-assignment

Extracting “hidden” HTML with Jsoup

阅读更多关于 Extracting “hidden” HTML with Jsoup

JSoup Searching for element

阅读更多关于 JSoup Searching for element

问题 I was wondering if someone could help me navigate a html page with jsoup. Probably the biggest issue I am having is using the .data() function. I am trying to pull the current weather when you google search "weather". Right now my code looks like: try{ Connection formPage = Jsoup.connect("https://www.google.com/search?q=weather&oq=weather&aqs=chrome..69i57j69i61j69i60j0l3.3806j0j7&sourceid=chrome&ie=UTF-8"); formPage.timeout(1000) .data("action", "wob_t") //.data("q", "Calgary") .method

File format for storing html parser rules

阅读更多关于 File format for storing html parser rules

问题 I'm using Jsoup to parse a page which structure changes over time. For now the parsing config is written in Java so I have to issue a new build each time the rules get modified. Is there some sort of json- or xml-based markup language I could use to store the parsing config in an external file? 回答1: You can try to use Clojure , Clojure can represent your config data and call Jsoup for parse 回答2: Options include XPath and CSS selectors syntax. The latter is supported by Jsoup. 来源： https:/

java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN in Jsoup getting a webpage

阅读更多关于 java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN in Jsoup getting a webpage

问题 I have Document document = Jsoup.connect(link).get(); and some times for some urls I get an exception: Exception in thread "main" java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN at java.nio.charset.Charset.forName(Unknown Source) at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:86) at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:469) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:147) I have a catch block as: catch (IOException e1) I

JSoup with userAgent prevent redirects

阅读更多关于 JSoup with userAgent prevent redirects

问题 I used JSoup for my web crawler Connection con = Jsoup.connect("http://t.co/uySIPVNfgP"); Document doc = con.get(); String u = doc.baseUri(); The above gives the redirected url as the base uri But with a User Agent set as follows: con.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); With the above does not follow the redirect. As I know without a User Agent some websites does not allow its contents to be crawled. How to solve this? 回答1:

HTML table id and class id

阅读更多关于 HTML table id and class id

问题 How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India I was able to see the classes wikitable sortable jquery-tablesorter This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table? What is the CSS equivalent to get all the names in that table? I want to get only the

HTML table id and class id

阅读更多关于 HTML table id and class id