jsoup

JSOUP parsing HTML get class inside class

这一生的挚爱 提交于 2020-01-06 08:12:03
问题 i am developing android application using JSOUP for parsing HTML. i have HTML syntax <div class='wrapper'> <div style='margin:7px;'> <div class='box' style='height:595px'> <div class='boxtitlebox'> <div class='boxtitle'><h4>13 RECENT CHORDS</h4></div><div class='clear'></div> </div> <div class='listitem'><a href='http://www.chordfrenzy.com/chord/9742/ungu-apa-sih-maumu-kord-lirik-lagu'> <div class='subtitle'>Chord Ungu</div> <div class='title'>Apa Sih Maumu</div> </a></div> <div class=

Scraping XML with JSoup

亡梦爱人 提交于 2020-01-06 07:59:09
问题 I'm trying to scrape an RSS feed located here. At the moment I'm just trying to wrap my head around JSoup, so the following code is merely proof of concept (or an attempt at it, at least). public static void grabShakers(String url) throws IOException { doc = Jsoup.connect(url).get(); desc = doc.select("title"); links = doc.select("link"); price = doc.select("span.price"); } It grabs the title of each item perfectly. The output of each link is simply ten repeated closing link tags and it never

Extracting “hidden” HTML with Jsoup

感情迁移 提交于 2020-01-06 07:31:40
问题 I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by "inspect element" in Google Chrome. Example page: http://assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false There are a number of div elements containing assignment data for U.S. Patent No. 9,000,000 that appear below the line <script async="async" type="text/javascript" src="https://components.uspto.gov/js/ais/2-2-assignment

Extracting “hidden” HTML with Jsoup

雨燕双飞 提交于 2020-01-06 07:31:08
问题 I am trying to get at HTML data that does not appear in the source document but can be exposed, for example, by "inspect element" in Google Chrome. Example page: http://assignment.uspto.gov/#/search?q=9000000&sort=patAssignorEarliestExDate%20desc%2C%20id%20desc&synonyms=false There are a number of div elements containing assignment data for U.S. Patent No. 9,000,000 that appear below the line <script async="async" type="text/javascript" src="https://components.uspto.gov/js/ais/2-2-assignment

JSoup Searching for element

爷,独闯天下 提交于 2020-01-06 05:33:05
问题 I was wondering if someone could help me navigate a html page with jsoup. Probably the biggest issue I am having is using the .data() function. I am trying to pull the current weather when you google search "weather". Right now my code looks like: try{ Connection formPage = Jsoup.connect("https://www.google.com/search?q=weather&oq=weather&aqs=chrome..69i57j69i61j69i60j0l3.3806j0j7&sourceid=chrome&ie=UTF-8"); formPage.timeout(1000) .data("action", "wob_t") //.data("q", "Calgary") .method

File format for storing html parser rules

时光毁灭记忆、已成空白 提交于 2020-01-06 04:22:04
问题 I'm using Jsoup to parse a page which structure changes over time. For now the parsing config is written in Java so I have to issue a new build each time the rules get modified. Is there some sort of json- or xml-based markup language I could use to store the parsing config in an external file? 回答1: You can try to use Clojure , Clojure can represent your config data and call Jsoup for parse 回答2: Options include XPath and CSS selectors syntax. The latter is supported by Jsoup. 来源: https:/

java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN in Jsoup getting a webpage

久未见 提交于 2020-01-06 04:04:30
问题 I have Document document = Jsoup.connect(link).get(); and some times for some urls I get an exception: Exception in thread "main" java.nio.charset.UnsupportedCharsetException: X-MAC-ROMAN at java.nio.charset.Charset.forName(Unknown Source) at org.jsoup.helper.DataUtil.parseByteData(DataUtil.java:86) at org.jsoup.helper.HttpConnection$Response.parse(HttpConnection.java:469) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:147) I have a catch block as: catch (IOException e1) I

JSoup with userAgent prevent redirects

…衆ロ難τιáo~ 提交于 2020-01-06 01:24:28
问题 I used JSoup for my web crawler Connection con = Jsoup.connect("http://t.co/uySIPVNfgP"); Document doc = con.get(); String u = doc.baseUri(); The above gives the redirected url as the base uri But with a User Agent set as follows: con.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); With the above does not follow the redirect. As I know without a User Agent some websites does not allow its contents to be crawled. How to solve this? 回答1:

HTML table id and class id

送分小仙女□ 提交于 2020-01-05 16:55:52
问题 How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India I was able to see the classes wikitable sortable jquery-tablesorter This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table? What is the CSS equivalent to get all the names in that table? I want to get only the

HTML table id and class id

梦想与她 提交于 2020-01-05 16:55:10
问题 How can I find the Table id of the large table on in the following url: http://en.wikipedia.org/wiki/States_and_territories_of_India I was able to see the classes wikitable sortable jquery-tablesorter This is the table which has list of states in India. I was able confirm from firebug that this table = wikitable sortable jquery-tablesorter is having the list of states. How can I get the ID of that table? What is the CSS equivalent to get all the names in that table? I want to get only the